/Tech1h ago

Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model

The 17-hour autonomous run consumed 25 million tokens

304351212350.6K

#265

Original post

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 42.4K Views

/Tech1h ago

Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model

The 17-hour autonomous run consumed 25 million tokens

304351212350.6K

#265

Original post

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 42.4K Views

Sentiment

Many users expressed amazement at Fable 5's 10x gains on the FrogsGame benchmark, while some found the results unsettling or criticized Anthropic.

Pos

66.7%

Neg

33.3%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS16KBOOKMARKS48LIKES154REPLIES8

Lisan al Gaib@scaling01

and this is why Anthropic restricted LLM development for Fable 5

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

1h16K15448

RETWEETS5

Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

48m6K989

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

1h11.7K9732

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

Andrew Curran@AndrewCurran_

42m1.8K292

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

The funniest outcome would be if Gemini finally converts their strategic bet (starting with XLand and onwards to Nano-Banana, Veo, Genie 4…) into a performance leap. Dario: scale. Sama: reasoning. Demis: world modeling. It's not just a FLOPS arms race, it's a paradigms clash.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

39m52671

Andrew Curran@AndrewCurran_

@fireandvision From Anthropic's Series C in 2023:

33m9361

ThisIsIsaac@dog_foot_ruler_

@thoughtfullab How were you not redirected to opus 4.8? Did you use special prompting to bypass Anthropic’s guardrail against llm research work when using fable5?

49m1.5K6

Andrew Curran@AndrewCurran_

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

47m1.3K70

Wondermonger@fireandvision

@AndrewCurran_ "2026–2027 is the critical window in AI. If you're ahead then, the models start getting better than humans at everything, including AI design and using AI to make better AI."

34m1024

Orion (e/acc)@SerendipitousOr

@scaling01 recursive self improvement soon

1h943

TheTinman@NguyenTinMan

@dog_foot_ruler_ @thoughtfullab That's only going forward. Since release, "frontier llm development" requests would make Fable give a "dumbed down" response. Which apparently is still much better than 4.8 or this wasn't considered frontier/sota

45m2552

Quinn’s Neural Pathways@NeuralNavQ

@thoughtfullab To me the 10x delta is the story. Search over experiments beats raw model capability here.

52m1.2K

Shinka - AI@ShinkaIoT

@scaling01 When the models start optimizing themselves, human control becomes the bottleneck, not the compute.

24m932

cqk@cqkten

@scaling01 Will Fable 6 will solve arbitrary tasks by just building ML models🤔 getting continual learning vibes...

1h1951

The Tower@TheWhiteTower16

@scaling01 wow that jump is insane

1h1451

Shannon Sands@max_paperclips

@teortaxesTex tbh, I think they needed to fall behind a little. Learn better messaging, focus more on product, don't get cocky with always having the best model. A little humility, understand they need to still work for it

14m412

The Tower@TheWhiteTower16

@karinanguyen thats just insane

1h244