/Tech4h ago

Google DeepMind's Samuel Albanie says Claude Fable 5 tied the ZeroBench state-of-the-art while topping WeirdML

The model outperformed Opus 4.8 and Gemini 3.5 Flash

341253.8K

#1005

Original post

Samuel Albanie 🇬🇧@SamuelAlbanie#1005inTech

fable is tied SotA on ZeroBench

Jonathan Roberts@JRobertsAI

Claude Fable 5 is strong on ZeroBench, but not a clear breakthrough

23% pass@5 (tied SOTA) 8% pass^5 (SOTA 10%)

3.6% refusal rate

For comparison, other recent releases (pass@5 / pass^5): Opus 4.8: 17 / 4 Gemini 3.5 Flash: 19 / 5

A good result, but still plenty of headroom

3:10 AM · Jun 11, 2026 · 672 Views

/Tech4h ago

Google DeepMind's Samuel Albanie says Claude Fable 5 tied the ZeroBench state-of-the-art while topping WeirdML

The model outperformed Opus 4.8 and Gemini 3.5 Flash

341253.8K

#1005

Original post

Samuel Albanie 🇬🇧@SamuelAlbanie#1005inTech

fable is tied SotA on ZeroBench

Jonathan Roberts@JRobertsAI

Claude Fable 5 is strong on ZeroBench, but not a clear breakthrough

23% pass@5 (tied SOTA) 8% pass^5 (SOTA 10%)

3.6% refusal rate

For comparison, other recent releases (pass@5 / pass^5): Opus 4.8: 17 / 4 Gemini 3.5 Flash: 19 / 5

A good result, but still plenty of headroom

3:10 AM · Jun 11, 2026 · 672 Views

Sentiment

Users are excited about Claude Fable 5 leading benchmarks like WeirdML because it delivers strong performance without high costs.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS93LIKES1

Florian Brand@xeophon

@htihle waow, not even that expensive

great!!

1h931

RETWEETS2

Håvard Ihle@htihle

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata.

The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

1h3.4K416