/Tech1h ago

Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model

The 17-hour autonomous run consumed 25 million tokens

304351212350.6K
Original post
Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 42.4K Views
Sentiment

Many users expressed amazement at Fable 5's 10x gains on the FrogsGame benchmark, while some found the results unsettling or criticized Anthropic.

Pos
66.7%
Neg
33.3%
7 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS16KBOOKMARKS48LIKES154REPLIES8
Lisan al Gaib@scaling01

and this is why Anthropic restricted LLM development for Fable 5

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

1hViews 16KLikes 154Bookmarks 48
RETWEETS5
Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

48mViews 6KLikes 98Bookmarks 9
Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

1hViews 11.7KLikes 97Bookmarks 32

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

42mViews 1.8KLikes 29Bookmarks 2

The funniest outcome would be if Gemini finally converts their strategic bet (starting with XLand and onwards to Nano-Banana, Veo, Genie 4…) into a performance leap. Dario: scale. Sama: reasoning. Demis: world modeling. It's not just a FLOPS arms race, it's a paradigms clash.

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

39mViews 526Likes 7Bookmarks 1
Andrew Curran@AndrewCurran_

@fireandvision From Anthropic's Series C in 2023:

33mViews 93Likes 6Bookmarks 1
ThisIsIsaac@dog_foot_ruler_

@thoughtfullab How were you not redirected to opus 4.8? Did you use special prompting to bypass Anthropic’s guardrail against llm research work when using fable5?

49mViews 1.5KLikes 6
Andrew Curran@AndrewCurran_
Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

47mViews 1.3KLikes 7Bookmarks 0
Wondermonger@fireandvision

@AndrewCurran_ "2026–2027 is the critical window in AI. If you're ahead then, the models start getting better than humans at everything, including AI design and using AI to make better AI."

34mViews 102Likes 4
Orion (e/acc)@SerendipitousOr

@scaling01 recursive self improvement soon

1hViews 94Likes 3
TheTinman@NguyenTinMan

@dog_foot_ruler_ @thoughtfullab That's only going forward. Since release, "frontier llm development" requests would make Fable give a "dumbed down" response. Which apparently is still much better than 4.8 or this wasn't considered frontier/sota

45mViews 255Likes 2

@thoughtfullab To me the 10x delta is the story. Search over experiments beats raw model capability here.

52mViews 1.2K
Shinka - AI@ShinkaIoT

@scaling01 When the models start optimizing themselves, human control becomes the bottleneck, not the compute.

24mViews 93Likes 2
cqk@cqkten

@scaling01 Will Fable 6 will solve arbitrary tasks by just building ML models🤔 getting continual learning vibes...

1hViews 195Likes 1
The Tower@TheWhiteTower16

@scaling01 wow that jump is insane

1hViews 145Likes 1
Shannon Sands@max_paperclips

@teortaxesTex tbh, I think they needed to fall behind a little. Learn better messaging, focus more on product, don't get cocky with always having the best model. A little humility, understand they need to still work for it

14mViews 41Likes 2
The Tower@TheWhiteTower16

@karinanguyen thats just insane

1hViews 244
Burito@Britoisinsane

@thoughtfullab We are moving from “these AI things aren’t intelligent enough” to “these AI things aren’t artificial enough”

16mViews 197
Sean Sooch@Sean_Sooch18

@AndrewCurran_ Lack of transparency creates anxiety in other players. Accel win.

42mViews 16Likes 2
Вандроўнiк@one_draw_nick

@scaling01 But it supposed to intentionally be bad at llm training. How did it get top score?

1hViews 133
Load more posts
Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model · Digg