/Tech3h ago

Epoch AI releases EBR-bench, finding current models fail to adapt and improve over repeated games

Story Overview

Epoch AI just dropped EBR-bench, a new test that runs AI models through repeated sessions of the intricate board game Earthborne Rangers to check whether they can study past errors and actually get better at strategy over time.

566092415273.8K

#184

Original post

Epoch AI@EpochAIResearch

Introducing EBR-bench, our new benchmark to measure on-the-fly learning.

AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.

9:09 AM · Jul 2, 2026 · 37.2K Views

Open Question

What the early runs reveal about model limits

No current system shows measurable gains despite feedback and multiple tries, leaving open how soon any model might cross into genuine on-the-fly adaptation.

Research Watch

Why this benchmark could shape future progress checks

By focusing on continual learning inside a rich game environment, the release gives researchers a concrete way to track whether recursive self-improvement is inching closer or still stalled.

Sentiment

Users praise Epoch AI's EBR-Bench as cool and reliable for testing continual learning in board games because it resists rigging and highlights genuine AI limitations.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS16.2KBOOKMARKS34LIKES181RETWEETS13REPLIES19

Ethan Mollick@emollick

Continual learning is probably the biggest barrier to explosive AI adoption (& may have big implications for recursive self-improvement as well)

As long as you deal with amnesiac models that require humans to do the learning for them, adoption will be gated by human processes.

Epoch AI@EpochAIResearch

Introducing EBR-bench, our new benchmark to measure on-the-fly learning.

AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.

2h16.2K18134

Lisan al Gaib@scaling01

this feels ARC-AGI-3 coded

it's also aiming at continual learning

again, I wonder how much you can do with decent scaffolding

Epoch AI@EpochAIResearch

Introducing EBR-bench, our new benchmark to measure on-the-fly learning.

AI repeatedly plays a challenging board game called Earthborne Rangers and tries to learn from its mistakes. So far: no signs of improvement.

2h9.7K9720

Lisan al Gaib@scaling01

that is pretty rough: no significant improvements even when given guides

Epoch AI@EpochAIResearch

Even if we give them a full strategy guide—the best set of notes we think they could take—models improve only modestly and still show no ability to get better with practice.

2h1.1K72

Epoch AI@EpochAIResearch

Have we under-elicited AI’s true capabilities? In the future, we plan to experiment with providing more tools (web search, code execution), trying different scaffolds, using multi-agent setups, and providing expert human playthrough transcripts. Let us know your ideas here!

3h47651

Epoch AI@EpochAIResearch

Models struggle with tactics. The game’s core damage mechanic is called “fatigue”, and taking too much fatigue is a sign of managing turn-by-turn play poorly. Models do better than random, but fall short of expert human performance.

3h13341

Epoch AI@EpochAIResearch

Read more about the benchmark on our website.

https://epoch.ai/publications/earthborne-rangers-benchmark

3h43941

Lisan al Gaib@scaling01

Epoch AI@EpochAIResearch

2h91231

Epoch AI@EpochAIResearch

If AI can learn on the fly, it becomes much more general-purpose. This has economic implications (learning on the job) as well as safety consequences (developing dangerous capabilities post-release). We study the ability to learn an unfamiliar game as a proxy for this dynamic.

3h2926

Epoch AI@EpochAIResearch

For this, we use Earthborne Rangers: a somewhat obscure, largely text-based campaign game. It requires a mix of strategic deck-building and tactical turn-by-turn play. A single playthrough takes humans 2–4 hours, and mastery may require dozens of playthroughs.

3h2566

Epoch AI@EpochAIResearch

AI could likely get better at EBR with focused RL training, and we suspect that AI companies have just not prioritized such tasks. So long as this remains the case, EBR-bench serves as a tool to detect the emergence of on-the-fly learning.

3h4435

Epoch AI@EpochAIResearch

Even if we give them a full strategy guide—the best set of notes we think they could take—models improve only modestly and still show no ability to get better with practice.

3h2105

Benjamin Ou@AlephNuul

Mostly, we didn't get enough samples for us to feel confident standing behind any of our scaling trends; they were just too noisy. But there were some interesting signs of per-decision inference scaling having extremely limited or even negative effects (below log-linear).

3h745

Epoch AI@EpochAIResearch

AI systems play the game repeatedly. They are given the rulebook, a card database, and the game’s map. They have a note-taking tool that persists across compactions. Their task is to maximize their score on the final 20% of playthroughs. We see no on-the-fly learning.

3h1614

Epoch AI@EpochAIResearch

Baseline performance has improved somewhat with newer generations of models. GPT-5.5 and Opus 4.8 clearly outscore GPT-5 and Opus 4.1, though progress since is less obvious. In any case, this comes from better out-of-the-box performance, not from on-the-fly learning.

3h1464

Epoch AI@EpochAIResearch

Models also struggle with strategy. A major aspect of this is deck-building, where the player chooses their initial cards. There are 32 “archetypes” of deck but models explore only a fraction of them. Many models stick to a single archetype in all their exploratory playthroughs.

3h1174

Benjamin Ou@AlephNuul

We also didn't talk about other sub-metrics besides fatigue because the article was running long. "Injuries" are the other way the game attacks you, and agent performance on that front is even worse than injuries; their play is very reckless and sloppy, and they even seem to...

3h754

Benjamin Ou@AlephNuul

@simonrowland @EpochAIResearch They're allowed to write plaintext notes like how MEMORY.md works with coding agents, but no RAG.

59m111

Taelin@VictorTaelin

@emollick 💯

Ethan Mollick@emollick

Continual learning is probably the biggest barrier to explosive AI adoption (& may have big implications for recursive self-improvement as well)

As long as you deal with amnesiac models that require humans to do the learning for them, adoption will be gated by human processes.

2h73550

Benjamin Ou@AlephNuul

...convince themselves that injuries have upside sometimes!? We also tracked the locations they visited, and they showed about as much entropy collapse there as they did with deckbuilding, always sticking to the same bad route instead of exploring to find better routes.

3h653

Lisan al Gaib@scaling01

also check out this thread:

Benjamin Ou@AlephNuul

Alright, time for my personal thread of all the juicy stuff that didn't make it into the main article!

2h2.1K30