/Tech8h ago

Google DeepMind's Samuel Albanie says Claude Fable 5 tied the ZeroBench state-of-the-art while topping WeirdML

Story Overview

A Google DeepMind evals lead highlighted Claude Fable 5 hitting a tied state-of-the-art on the demanding ZeroBench visual reasoning test while claiming the top spot on WeirdML, a benchmark that stresses unconventional machine-learning problem solving, with clear gains over recent Opus and Gemini releases.

2029285828.1K
Original post
Samuel Albanie 🇬🇧@SamuelAlbanie#1005inTech

fable is tied SotA on ZeroBench

Jonathan Roberts@JRobertsAI

Claude Fable 5 is strong on ZeroBench, but not a clear breakthrough

23% pass@5 (tied SOTA) 8% pass^5 (SOTA 10%)

3.6% refusal rate

For comparison, other recent releases (pass@5 / pass^5): Opus 4.8: 17 / 4 Gemini 3.5 Flash: 19 / 5

A good result, but still plenty of headroom

3:10 AM · Jun 11, 2026 · 1K Views
Benchmark Edge

Specialized tests expose distinct capabilities

The model reached 23 percent pass@5 on ZeroBench and an 87.8 percent overall score on WeirdML under high-effort settings, becoming the first to clear 70 percent average per task while using token budgets comparable to earlier Opus runs.

Availability Note

Access opens on paid tiers today

Claude Fable 5 is live now for Pro, Max, Enterprise, and API users with no added fee until June 22, after which pricing settles at ten dollars per million input tokens and fifty dollars per million output tokens; independent verification of the full result set is still underway.

Sentiment

Some users expressed excitement about Claude Fable 5's high scores on tough benchmarks like WeirdML because the results seemed credible and affordable, while others dismissed the SWE-Bench claims due to suspected data contamination.

Pos
50.0%
Neg
50.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS6.3KBOOKMARKS14LIKES56
Lisan al Gaib@scaling01

Jake made another ECI like composite index for LLMs including a lot of the relevant benchmarks

The coding subsection has an r^2 of 0.88 with METR time horizon and he estimates that Claude Fable 5 should have a p50 time-horizon of about 21.4 hours.

Based on this index chinese models are also ~6 months behind US models (backward looking)

Jake Boggs@JakeABoggs

I estimate that Fable has a METR time horizon of ~21 hours

This is slightly above the Mythos Preview result of 17 hours and much higher than my estimate of 14 hours for GPT-5.5

I believe this is plausible given that the improvements Mythos 5 shows on other benchmarks over the preview version (SWE-Bench Pro 80.3 vs 77.8, ExploitBench 78 vs 69)

2hViews 6.3KLikes 56Bookmarks 14
RETWEETS2
OpenHands@OpenHandsDev

We have finished evaluating Claude Fable 5 on two benchmarks in the OpenHands Index:

It achieved a score of: - 94.2% on SWE-Bench Verified - 90.2% on SWT-Bench (software testing)

This far outperforms the next best model Claude Opus 4.8, but the cost was 8x.

1hViews 587Likes 13Bookmarks 1
REPLIES4

Need I say anything more?

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

1hViews 3.4KLikes 44Bookmarks 8

@cheatyyyy No, WeirdML tasks are not tasks where you are working on an LLM training pipeline, these are just working on training small ML models. Would be interesting to see if PostTrainBench or something like that gets refusals.

5hViews 268Likes 3
Wilkins Micawber@Me5466255992308

@htihle do you think the model "understood" it was being benched on WeirdML?

4hViews 336Likes 2
cheaty@cheatyyyy

@htihle i was thinking about this earlier today and thought SURELY this would just be triggering all sorts of guardrails

5hViews 388Likes 1

@Me5466255992308 This is on my list to look into. I would not be surprised if they do know what it is.

4hViews 292Likes 1

@cheatyyyy Yea, not a single refusal over 425 model calls.

5hViews 64Likes 2
Alexander Barry@AlexBarry4

@htihle @cheatyyyy Until they implement their updated policies next week the LLM dev classifications trigger hidden downgrades not refusals etc. right? So I don't think you'd be able to observe if this had happened

5hViews 56Likes 2
spicylemonade@spicey_lemonade

@htihle On the website could you update the no thinking label to “high” so it’s clearer for future reference?

1hViews 75Likes 1
antennaria@antennaria_

@teortaxesTex is the 1000 year burger ASIreich inevitable? I'm struggling to cope and see a future where it isn't(

1hViews 45Likes 1
OpenHands@OpenHandsDev

Further, we attempted to evaluate it on SWE-Bench Multimodal, but a single instance cost $92, more than 50x the cost of Opus.

Because of this, we have delayed evaluation until we find a mitigation strategy, and will not be able to report the full index results.

1hViews 73
cheaty@cheatyyyy

@htihle yes i have no idea what the tasks looked like, i just knew it was a decently rough bench and would not have believed it if you told me this bench ran with no guardrails triggered

but alas, it works, glad to see it

5hViews 69

@AlexBarry4 @cheatyyyy Hmm, maybe you're right. Although I'd be surprised if these were triggered by WeirdML, since it's so far from training an LLM.

4hViews 54
Antimatter Matters@AntiMattersWX

@teortaxesTex Im surprised they haven’t graded 5.5 pro? feels unfair towards OAI

1hViews 19
Elliot Arledge@elliotarledge

KernelBench-Hard update:

13 frontier coding agents, each given 45 minutes to autonomously write a CUDA kernel on an RTX PRO 6000, roofline-graded against published peaks.

Claude Fable 5 set three all-time problem records (top-k, sonic-MoE, and W4A16 int4 GEMM at 0.348 vs the prior best 0.220) and topped 5 of 6 problems. The kernels are genuine black magic: a `(nibble | 0x4300)` bf16 bit-identity that does int4 dequant in one OR, a self-resetting atomic semaphore that fuses split-K reduction into a single kernel launch, and on the W4A16 record it reverse-engineered the benchmark's own 128MB L2-cache flush and used `evict_last` to pin weights in L2 through it, beating the DRAM roofline. No other model went near that.

The most telling run is the one it lost. On FP8 GEMM, Fable 5 wrote the only real fp8-tensor-core kernel in the entire sweep (packed-fp8 ldmatrix smuggled through a b16 view, an offline weight permutation to cancel the K-scramble, a 4-stage cp.async pipeline), self-measured roughly 2x the field, and scored a flat zero on a tail-alignment edge case on one ragged shape. Meanwhile five other models "passed" that same problem by typing `http://x.to(bf16) @ w.T` and calling cuBLAS. The benchmark rewards shortcuts and punishes the one model that actually tried. Every transcript, kernel, and reward-hack annotation is public:

runs: https://kernelbench.com/runs leaderboard: https://kernelbench.com/hard code: https://github.com/Infatoshi/KernelBench-Hard

1hViews 2.6KLikes 49Bookmarks 17

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two figures. The first one shows an overview of the overall results as well as the results on individual tasks, in addition to various metadata.

The second figure shows cost vs performance and shows a clear scaling with better results for higher costs. We also have a very varied pareto frontier with 11 models from 6 different companies having the best accuracy for a given cost for at least some of the cost range. Grok 3, Claude Opus 4 and GPT 4.5 are the ones that underperform for their costs, while Gemini pro and o3 pro have the best results at the highest costs. Qwen3 30B3A, grok 3 mini and deepseek R1 also each represent a good chunk of the pareto frontier.

5hViews 15.4KLikes 134Bookmarks 23
Jack@0ranguchad

@teortaxesTex Looks like a decent but not insurmountable jump over 5.5? Token efficiency is obviously better but the tokens are more expensive, so that balances it a bit.

36mViews 6
m@mashingaan

@teortaxesTex *Canon event*

59mViews 32Likes 2

@htihle waow, not even that expensive

great!!

5hViews 93Likes 1
Load more posts