/Tech4h ago

Anthropic's Fable 5 achieves a 10x performance gain on Thoughtful Lab’s FrogsGame by autonomously training a weaker model

AI Judge changed title after evaluation, original title: "Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model"

Other tested frontier models averaged under 4% on the benchmark.

591.2K27375206.6K

#265

Original post

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 145.3K Views

/Tech4h ago

Anthropic's Fable 5 achieves a 10x performance gain on Thoughtful Lab’s FrogsGame by autonomously training a weaker model

AI Judge changed title after evaluation, original title: "Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model"

Other tested frontier models averaged under 4% on the benchmark.

591.2K27375206.6K

#265

Original post

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 145.3K Views

Sentiment

Positive users express excitement over Fable 5's large gains on the FrogsGame benchmark and its research potential, while negative users call the results boring and unusable and harshly criticize Anthropic's approach.

Pos

46.4%

Neg

53.6%

17 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS62.5KBOOKMARKS108LIKES407

Lisan al Gaib@scaling01

and this is why Anthropic restricted LLM development for Fable 5

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

3h62.5K407108

RETWEETS13REPLIES14

Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

2h21.9K26243

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

4h37.1K22175

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

Andrew Curran@AndrewCurran_

2h5.1K7811

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The funniest outcome would be if Gemini finally converts their strategic bet (starting with XLand and onwards to Nano-Banana, Veo, Genie 4…) into a performance leap. Dario: scale. Sama: reasoning. Demis: world modeling. It's not just a FLOPS arms race, it's a paradigms clash.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

2h1K233

Andrew Curran@AndrewCurran_

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

2h2.5K101

Andrew Curran@AndrewCurran_

@fireandvision From Anthropic's Series C in 2023:

2h9361

ThisIsIsaac@dog_foot_ruler_

@thoughtfullab How were you not redirected to opus 4.8? Did you use special prompting to bypass Anthropic’s guardrail against llm research work when using fable5?

2h1.5K6

Wondermonger@fireandvision

@AndrewCurran_ "2026–2027 is the critical window in AI. If you're ahead then, the models start getting better than humans at everything, including AI design and using AI to make better AI."

2h1024

Josh@JoshPurtell

@AndrewCurran_ You’re getting out of your depth. It found a clever pseudo reward hack for that Env that doesn’t generalize

2h1422

Caitlin Kalinowski@kalinowski007

Damn.

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

1h90000

Thoughtful@thoughtfullab

That said, we dislike FrogsGame as a task internally. The frogs know what they did. We're now sprinting toward adding more useful, real-world posttraining tasks, partly out of ambition, partly to put a distance between us and the frogs 🐸

1h4684

Potrock@Potrock_

@JoshPurtell @AndrewCurran_

2h791

Josh@JoshPurtell

@Potrock_ @AndrewCurran_ Pseudo is the key word here

2h351

Potrock@Potrock_

@JoshPurtell @AndrewCurran_ They looked, they validated, but surely you are right. how many pseudos turn into 1 valid solution?

2h271

Josh@JoshPurtell

@Potrock_ @AndrewCurran_ Idk what that means but it just wrote a solution for the env in python and then used STaR to turn it into finetuning data. That’s technically legal but doesn’t generalize

2h171

Solgato@Tigger0000

@AndrewCurran_ slaughterbots

1h101

Orion (e/acc)@SerendipitousOr

@scaling01 recursive self improvement soon

3h943

TheTinman@NguyenTinMan

@dog_foot_ruler_ @thoughtfullab That's only going forward. Since release, "frontier llm development" requests would make Fable give a "dumbed down" response. Which apparently is still much better than 4.8 or this wasn't considered frontier/sota

2h2552

Huck111@Huck1112

@scaling01 But we need this. This would lead to more research being done. I don't understand why it is restricted. It is what i wanted to use it most for

2h1692