Introducing EBR-bench, our new benchmark to measure on-the-fly learning.
AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.
Epoch AI just dropped EBR-bench, a new test that runs AI models through repeated sessions of the intricate board game Earthborne Rangers to check whether they can study past errors and actually get better at strategy over time.
Introducing EBR-bench, our new benchmark to measure on-the-fly learning.
AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.
No current system shows measurable gains despite feedback and multiple tries, leaving open how soon any model might cross into genuine on-the-fly adaptation.
By focusing on continual learning inside a rich game environment, the release gives researchers a concrete way to track whether recursive self-improvement is inching closer or still stalled.
Users praise Epoch AI's EBR-Bench as cool and reliable for testing continual learning in board games because it resists rigging and highlights genuine AI limitations.
No Digg Deeper questions have been answered for this story yet.
Continual learning is probably the biggest barrier to explosive AI adoption (& may have big implications for recursive self-improvement as well)
As long as you deal with amnesiac models that require humans to do the learning for them, adoption will be gated by human processes.
Introducing EBR-bench, our new benchmark to measure on-the-fly learning.
AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.
this feels ARC-AGI-3 coded
it's also aiming at continual learning
again, I wonder how much you can do with decent scaffolding
Introducing EBR-bench, our new benchmark to measure on-the-fly learning.
AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.
that is pretty rough: no significant improvements even when given guides
Even if we give them a full strategy guide—the best set of notes we think they could take—models improve only modestly and still show no ability to get better with practice.

Have we under-elicited AI’s true capabilities? In the future, we plan to experiment with providing more tools (web search, code execution), trying different scaffolds, using multi-agent setups, and providing expert human playthrough transcripts. Let us know your ideas here!

Models struggle with tactics. The game’s core damage mechanic is called “fatigue”, and taking too much fatigue is a sign of managing turn-by-turn play poorly. Models do better than random, but fall short of expert human performance.

Read more about the benchmark on our website.
https://epoch.ai/publications/earthborne-rangers-benchmark
Have we under-elicited AI’s true capabilities? In the future, we plan to experiment with providing more tools (web search, code execution), trying different scaffolds, using multi-agent setups, and providing expert human playthrough transcripts. Let us know your ideas here!

If AI can learn on the fly, it becomes much more general-purpose. This has economic implications (learning on the job) as well as safety consequences (developing dangerous capabilities post-release). We study the ability to learn an unfamiliar game as a proxy for this dynamic.

For this, we use Earthborne Rangers: a somewhat obscure, largely text-based campaign game. It requires a mix of strategic deck-building and tactical turn-by-turn play. A single playthrough takes humans 2–4 hours, and mastery may require dozens of playthroughs.

AI could likely get better at EBR with focused RL training, and we suspect that AI companies have just not prioritized such tasks. So long as this remains the case, EBR-bench serves as a tool to detect the emergence of on-the-fly learning.

Even if we give them a full strategy guide—the best set of notes we think they could take—models improve only modestly and still show no ability to get better with practice.

Mostly, we didn't get enough samples for us to feel confident standing behind any of our scaling trends; they were just too noisy. But there were some interesting signs of per-decision inference scaling having extremely limited or even negative effects (below log-linear).

AI systems play the game repeatedly. They are given the rulebook, a card database, and the game’s map. They have a note-taking tool that persists across compactions. Their task is to maximize their score on the final 20% of playthroughs. We see no on-the-fly learning.

Baseline performance has improved somewhat with newer generations of models. GPT-5.5 and Opus 4.8 clearly outscore GPT-5 and Opus 4.1, though progress since is less obvious. In any case, this comes from better out-of-the-box performance, not from on-the-fly learning.

Models also struggle with strategy. A major aspect of this is deck-building, where the player chooses their initial cards. There are 32 “archetypes” of deck but models explore only a fraction of them. Many models stick to a single archetype in all their exploratory playthroughs.

We also didn't talk about other sub-metrics besides fatigue because the article was running long. "Injuries" are the other way the game attacks you, and agent performance on that front is even worse than injuries; their play is very reckless and sloppy, and they even seem to...

@simonrowland @EpochAIResearch They're allowed to write plaintext notes like how MEMORY.md works with coding agents, but no RAG.
@emollick 💯
Continual learning is probably the biggest barrier to explosive AI adoption (& may have big implications for recursive self-improvement as well)
As long as you deal with amnesiac models that require humans to do the learning for them, adoption will be gated by human processes.

...convince themselves that injuries have upside sometimes!? We also tracked the locations they visited, and they showed about as much entropy collapse there as they did with deckbuilding, always sticking to the same bad route instead of exploring to find better routes.
also check out this thread:
Alright, time for my personal thread of all the juicy stuff that didn't make it into the main article!