/Tech8h ago

Google DeepMind's Samuel Albanie says Claude Fable 5 tied the ZeroBench state-of-the-art while topping WeirdML

Story Overview

A Google DeepMind evals lead highlighted Claude Fable 5 hitting a tied state-of-the-art on the demanding ZeroBench visual reasoning test while claiming the top spot on WeirdML, a benchmark that stresses unconventional machine-learning problem solving, with clear gains over recent Opus and Gemini releases.

2029285828.1K

#440

Original post

Samuel Albanie 🇬🇧@SamuelAlbanie#1005inTech

fable is tied SotA on ZeroBench

Jonathan Roberts@JRobertsAI

Claude Fable 5 is strong on ZeroBench, but not a clear breakthrough

23% pass@5 (tied SOTA) 8% pass^5 (SOTA 10%)

3.6% refusal rate

For comparison, other recent releases (pass@5 / pass^5): Opus 4.8: 17 / 4 Gemini 3.5 Flash: 19 / 5

A good result, but still plenty of headroom

3:10 AM · Jun 11, 2026 · 1K Views

/Tech8h ago

Google DeepMind's Samuel Albanie says Claude Fable 5 tied the ZeroBench state-of-the-art while topping WeirdML

Story Overview

2029285828.1K

#440

Original post

Samuel Albanie 🇬🇧@SamuelAlbanie#1005inTech

fable is tied SotA on ZeroBench

Jonathan Roberts@JRobertsAI

Claude Fable 5 is strong on ZeroBench, but not a clear breakthrough

23% pass@5 (tied SOTA) 8% pass^5 (SOTA 10%)

3.6% refusal rate

For comparison, other recent releases (pass@5 / pass^5): Opus 4.8: 17 / 4 Gemini 3.5 Flash: 19 / 5

A good result, but still plenty of headroom

3:10 AM · Jun 11, 2026 · 1K Views

Benchmark Edge

Specialized tests expose distinct capabilities

The model reached 23 percent pass@5 on ZeroBench and an 87.8 percent overall score on WeirdML under high-effort settings, becoming the first to clear 70 percent average per task while using token budgets comparable to earlier Opus runs.

Availability Note

Access opens on paid tiers today

Claude Fable 5 is live now for Pro, Max, Enterprise, and API users with no added fee until June 22, after which pricing settles at ten dollars per million input tokens and fifty dollars per million output tokens; independent verification of the full result set is still underway.

Sentiment

Some users expressed excitement about Claude Fable 5's high scores on tough benchmarks like WeirdML because the results seemed credible and affordable, while others dismissed the SWE-Bench claims due to suspected data contamination.

Pos

50.0%

Neg

50.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS6.3KBOOKMARKS14LIKES56

Lisan al Gaib@scaling01

Jake made another ECI like composite index for LLMs including a lot of the relevant benchmarks

The coding subsection has an r^2 of 0.88 with METR time horizon and he estimates that Claude Fable 5 should have a p50 time-horizon of about 21.4 hours.

Based on this index chinese models are also ~6 months behind US models (backward looking)

Jake Boggs@JakeABoggs

I estimate that Fable has a METR time horizon of ~21 hours

This is slightly above the Mythos Preview result of 17 hours and much higher than my estimate of 14 hours for GPT-5.5

I believe this is plausible given that the improvements Mythos 5 shows on other benchmarks over the preview version (SWE-Bench Pro 80.3 vs 77.8, ExploitBench 78 vs 69)

2h6.3K5614

RETWEETS2

OpenHands@OpenHandsDev

We have finished evaluating Claude Fable 5 on two benchmarks in the OpenHands Index:

It achieved a score of: - 94.2% on SWE-Bench Verified - 90.2% on SWT-Bench (software testing)

This far outperforms the next best model Claude Opus 4.8, but the cost was 8x.

1h587131

REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Need I say anything more?

Håvard Ihle@htihle

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

1h3.4K448

Håvard Ihle@htihle

@cheatyyyy No, WeirdML tasks are not tasks where you are working on an LLM training pipeline, these are just working on training small ML models. Would be interesting to see if PostTrainBench or something like that gets refusals.

5h2683

Wilkins Micawber@Me5466255992308

@htihle do you think the model "understood" it was being benched on WeirdML?

4h3362

cheaty@cheatyyyy

@htihle i was thinking about this earlier today and thought SURELY this would just be triggering all sorts of guardrails

5h3881

Håvard Ihle@htihle

@Me5466255992308 This is on my list to look into. I would not be surprised if they do know what it is.

4h2921

Håvard Ihle@htihle

@cheatyyyy Yea, not a single refusal over 425 model calls.

5h642

Alexander Barry@AlexBarry4

@htihle @cheatyyyy Until they implement their updated policies next week the LLM dev classifications trigger hidden downgrades not refusals etc. right? So I don't think you'd be able to observe if this had happened

5h562

spicylemonade@spicey_lemonade

@htihle On the website could you update the no thinking label to “high” so it’s clearer for future reference?

1h751

antennaria@antennaria_

@teortaxesTex is the 1000 year burger ASIreich inevitable? I'm struggling to cope and see a future where it isn't(

1h451

OpenHands@OpenHandsDev

Further, we attempted to evaluate it on SWE-Bench Multimodal, but a single instance cost $92, more than 50x the cost of Opus.

Because of this, we have delayed evaluation until we find a mitigation strategy, and will not be able to report the full index results.

1h73

cheaty@cheatyyyy

@htihle yes i have no idea what the tasks looked like, i just knew it was a decently rough bench and would not have believed it if you told me this bench ran with no guardrails triggered

but alas, it works, glad to see it

5h69

Håvard Ihle@htihle

@AlexBarry4 @cheatyyyy Hmm, maybe you're right. Although I'd be surprised if these were triggered by WeirdML, since it's so far from training an LLM.

4h54

Antimatter Matters@AntiMattersWX

@teortaxesTex Im surprised they haven’t graded 5.5 pro? feels unfair towards OAI

1h19

Elliot Arledge@elliotarledge

KernelBench-Hard update:

13 frontier coding agents, each given 45 minutes to autonomously write a CUDA kernel on an RTX PRO 6000, roofline-graded against published peaks.

Claude Fable 5 set three all-time problem records (top-k, sonic-MoE, and W4A16 int4 GEMM at 0.348 vs the prior best 0.220) and topped 5 of 6 problems. The kernels are genuine black magic: a `(nibble | 0x4300)` bf16 bit-identity that does int4 dequant in one OR, a self-resetting atomic semaphore that fuses split-K reduction into a single kernel launch, and on the W4A16 record it reverse-engineered the benchmark's own 128MB L2-cache flush and used `evict_last` to pin weights in L2 through it, beating the DRAM roofline. No other model went near that.

The most telling run is the one it lost. On FP8 GEMM, Fable 5 wrote the only real fp8-tensor-core kernel in the entire sweep (packed-fp8 ldmatrix smuggled through a b16 view, an offline weight permutation to cancel the K-scramble, a 4-stage cp.async pipeline), self-measured roughly 2x the field, and scored a flat zero on a tail-alignment edge case on one ragged shape. Meanwhile five other models "passed" that same problem by typing `http://x.to(bf16) @ w.T` and calling cuBLAS. The benchmark rewards shortcuts and punishes the one model that actually tried. Every transcript, kernel, and reward-hack annotation is public:

runs: https://kernelbench.com/runs leaderboard: https://kernelbench.com/hard code: https://github.com/Infatoshi/KernelBench-Hard

1h2.6K4917

Håvard Ihle@htihle

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

Håvard Ihle@htihle

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata.

The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

5h15.4K13423

Jack@0ranguchad

@teortaxesTex Looks like a decent but not insurmountable jump over 5.5? Token efficiency is obviously better but the tokens are more expensive, so that balances it a bit.

36m6

m@mashingaan

@teortaxesTex *Canon event*

59m322

Florian Brand@xeophon

@htihle waow, not even that expensive

great!!

5h931