/Tech4h ago

Anthropic's Fable 5 achieves a 10x performance gain on Thoughtful Lab’s FrogsGame by autonomously training a weaker model

AI Judge changed title after evaluation, original title: "Fable 5 achieves a 10x performance gain on Thoughtful Lab's FrogsGame benchmark by autonomously training a weaker model"

Other tested frontier models averaged under 4% on the benchmark.

591.2K27375206.6K
Original post
Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

Thoughtful@thoughtfullab

Model shaping is still a craft of a few. That's what AI agents are for: learning it and doing it for everyone else.

As a part of FrontierSWE benchmark we built a 20-hour post-training task on @tinkerapi and found the real bottleneck is research intuition.

3:23 PM · Jun 11, 2026 · 145.3K Views
Sentiment

Positive users express excitement over Fable 5's large gains on the FrogsGame benchmark and its research potential, while negative users call the results boring and unusable and harshly criticize Anthropic's approach.

Pos
46.4%
Neg
53.6%
17 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS62.5KBOOKMARKS108LIKES407
Lisan al Gaib@scaling01

and this is why Anthropic restricted LLM development for Fable 5

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

3hViews 62.5KLikes 407Bookmarks 108
RETWEETS13REPLIES14
Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

2hViews 21.9KLikes 262Bookmarks 43
Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

Thoughtful@thoughtfullab

Fable 5 is doing something wild on our FrogsGame post-training task.

It trains a weaker model to solve the puzzle, peaks at 68%, and produces the only ~10x improvement we see across the benchmark.

It spent 17 hours, 25M tokens without human in sight. 34% pass@1, while every other frontier model averages under 4%.

We will publish a more detailed analysis soon.

4hViews 37.1KLikes 221Bookmarks 75

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

Andrew Curran@AndrewCurran_

Once in a while you see charts like this for Mythos, and now for Fable. The entire game changed in February when Mythos emerged from training. Imagine what Anthropic has developed internally since then. I think they are on a completely different - invisible - trajectory now.

2hViews 5.1KLikes 78Bookmarks 11

The funniest outcome would be if Gemini finally converts their strategic bet (starting with XLand and onwards to Nano-Banana, Veo, Genie 4…) into a performance leap. Dario: scale. Sama: reasoning. Demis: world modeling. It's not just a FLOPS arms race, it's a paradigms clash.

I wouldn't read too much into this tbh, I'm kind of negging OpenAI. They were getting too smug for a while; now they're paying for it. But they remain the OG AGI lab. They had two such moments in the history of commercial LLMs: GPT-4 and o1. Dario had *nothing* on o1 either.

2hViews 1KLikes 23Bookmarks 3
Andrew Curran@AndrewCurran_
Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

2hViews 2.5KLikes 10Bookmarks 1
Andrew Curran@AndrewCurran_

@fireandvision From Anthropic's Series C in 2023:

2hViews 93Likes 6Bookmarks 1
ThisIsIsaac@dog_foot_ruler_

@thoughtfullab How were you not redirected to opus 4.8? Did you use special prompting to bypass Anthropic’s guardrail against llm research work when using fable5?

2hViews 1.5KLikes 6
Wondermonger@fireandvision

@AndrewCurran_ "2026–2027 is the critical window in AI. If you're ahead then, the models start getting better than humans at everything, including AI design and using AI to make better AI."

2hViews 102Likes 4
Josh@JoshPurtell

@AndrewCurran_ You’re getting out of your depth. It found a clever pseudo reward hack for that Env that doesn’t generalize

2hViews 142Likes 2
Caitlin Kalinowski@kalinowski007

Damn.

Karina@karinanguyen

the heart attack continues w/ Fable 5, we checked there were no reward hacks

1hViews 900Likes 0Bookmarks 0
Thoughtful@thoughtfullab

That said, we dislike FrogsGame as a task internally. The frogs know what they did. We're now sprinting toward adding more useful, real-world posttraining tasks, partly out of ambition, partly to put a distance between us and the frogs 🐸

1hViews 468Likes 4
Potrock@Potrock_

@JoshPurtell @AndrewCurran_

2hViews 79Likes 1
Josh@JoshPurtell

@Potrock_ @AndrewCurran_ Pseudo is the key word here

2hViews 35Likes 1
Potrock@Potrock_

@JoshPurtell @AndrewCurran_ They looked, they validated, but surely you are right. how many pseudos turn into 1 valid solution?

2hViews 27Likes 1
Josh@JoshPurtell

@Potrock_ @AndrewCurran_ Idk what that means but it just wrote a solution for the env in python and then used STaR to turn it into finetuning data. That’s technically legal but doesn’t generalize

2hViews 17Likes 1
Solgato@Tigger0000

@AndrewCurran_ slaughterbots

1hViews 10Likes 1
Orion (e/acc)@SerendipitousOr

@scaling01 recursive self improvement soon

3hViews 94Likes 3
TheTinman@NguyenTinMan

@dog_foot_ruler_ @thoughtfullab That's only going forward. Since release, "frontier llm development" requests would make Fable give a "dumbed down" response. Which apparently is still much better than 4.8 or this wasn't considered frontier/sota

2hViews 255Likes 2
Huck111@Huck1112

@scaling01 But we need this. This would lead to more research being done. I don't understand why it is restricted. It is what i wanted to use it most for

2hViews 169Likes 2
Load more posts