/Tech2d ago

Anthropic's Claude Fable 5 takes first place on the Agent Arena leaderboard with an 18.23% success rate

It also achieved an 87.8% score on the WeirdML benchmark.

801.4K54212116K

#195

Original post

François Fleuret@francoisfleuret#577inTech

No slowing down in sight, this is so weird.

Claude@claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.

The longer and more complex the task, the larger Fable 5’s lead over our other models.

12:26 AM · Jun 10, 2026 · 16.9K Views

Sentiment

Positive users praise Claude Fable 5 for topping agent leaderboards and computer-use benchmarks with strong performance at lower cost, while negative users call the gains overhyped or poor value compared to prior models.

Pos

63.1%

Neg

36.9%

48 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS67.7KBOOKMARKS177RETWEETS37

Aaron Levie@levie

Lots of evidence of huge jumps in capability for Fable across coding (and related) tasks. It’s also a major jump in accuracy and success in complex knowledge work tasks.

In our Box AI Complex Work Eval, we tested the model against Opus 4.8 and saw huge boosts across almost every industry. For our eval we give the Box AI Agent, using Fable, a set of hard real world knowledge work problems that deal with enterprise documents. Then score how the agent performs the tasks.

The main differentiators for Fable vs Opus 4.8 is that it doesn't take shortcuts on complex reasoning, it gets multi-step calculations right, and it's significantly more consistent across runs. We saw the biggest leaps in Media & Entertainment (78% vs 61%), Technology (81% vs 73%), Financial Services (89% vs 83%), and Healthcare (66% vs 60%).

Here are some specific examples:

* Legal M&A due diligence: On a task reviewing NDA terms against a semiconductor company's contracting policy, Fable correctly identified that a joint-ownership clause violates exclusivity requirements while a liability cap is permitted under a Super Cap exception. Fable scored 100% vs Opus's 78%.

* Healthcare: On a clinical radiology error audit across 12 reports, Fable precisely categorized each error by severity grade and correctly concluded no Grade 3 errors existed. Opus prematurely escalated a case to "major error requiring immediate departmental review" when the evidence didn't support it — Fable 63% vs Opus 41%.

* Media & Entertainment: On a genre profitability projection task, Fable correctly recognized that a 20% Argentine tax deduction was already embedded in the source spreadsheet figures and didn't double-apply it. Opus applied it again on top — a compounding error across 4 genre calculations that took its score negative on the task vs Fable's 74%.

* Retail analytics: On a task analyzing high-growth product articles against an investment benchmark, Fable correctly computed each article's growth rate individually and identified that only 2 of 5 exceeded the threshold. Opus confused "high growth relative to average" with "above the benchmark" — scoring 61% vs Fable's 94%.

* Financial Services: On a 5-year debt facility projection, Fable correctly applied interest to opening balances and used the right capex figure. Opus applied interest to the total facility amount and computed tax from the wrong base — two compounding errors. Fable scored 83% vs Opus's 62%.

* Technology: On a SaaS feature valuation requiring computation of a Feature Value Index across multiple regions, Fable applied the formula correctly and got exact values for the markets. Opus got the arithmetic wrong on multiple criteria — Fable scored 100% vs Opus's 74%.

Overall, huge step change in complex analysis, work that requires analytical reasoning, and deep domain understanding. Fable will be available shortly in the Box AI Studio for customers to build agents with.

Claude@claudeai

Fable 5 is state-of-the-art on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.

The longer and more complex the task, the larger Fable 5’s lead over our other models.

1d67.7K352177

LIKES675REPLIES44

Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1d48K675112

Lisan al Gaib@scaling01

insane jump in confirmed successes and praises by users

Arena.ai@arena

Exciting news: Claude Fable 5 ranks #1 on the new Agent Arena leaderboard!

Fable 5 leads by the widest margin ever over Opus-4.8 and GPT-5.5 on two key signals: confirmed task success rate and praise vs. complaint, despite weaker steerability. If Fable can do something, it will do it very well. If it can't/doesn't want to do something, it may be hard to steer the model towards the goal.

In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks. Models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model.

Huge congrats to @AnthropicAI for the incredible milestone! Below we break down how Claude Fable 5 (based on Mythos) scored across 5 signals, drawn from tasks submitted by a global community of users.

1d25.2K39664

Lisan al Gaib@scaling01

Claude Fable 5 is of course well ahead of the previous token-efficiency frontier

if you love big models clap your hands 👏 (scaling works)

9h13.6K29937

Lisan al Gaib@scaling01

Fable 5 is almost ranked 1st everywhere on Vals AI

1d16.1K33833

Florian Brand@xeophon

> We tested Fable 5 using Anthropic’s new ‘fallback’ mechanism, which can route safety-flagged messages to Claude Opus 4.8.

🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️🤦‍♂️

And it affected 8% of the samples, so it score needs to be revised down, possibly quite a bit, bringing it closer to the rest of the pack

Artificial Analysis@ArtificialAnlys

Claude Fable 5 launched today at #1 on the Artificial Analysis Intelligence Index, putting Anthropic nearly 5 points ahead of any other lab’s best model

We supported @AnthropicAI with pre-release evaluation of Claude Fable 5. Claude Fable 5 scores 64.9 on the Artificial Analysis Intelligence Index, claiming the #1 rank overall. It is ~5 points ahead of the closest non-Anthropic model (GPT-5.5), and Anthropic models now occupy both of the top 2 places.

Key takeaways for Claude Fable 5 (adaptive reasoning with max effort and Opus 4.8 as fallback model):

➤ New safety guardrails for Mythos-class models: Claude Fable 5 uses the same underlying model as Claude Mythos 5 for public usage, with additional guardrails for potentially-harmful cybersecurity, biology, chemistry, and distillation-related queries. We tested Fable 5 using Anthropic’s new ‘fallback’ mechanism, which can route safety-flagged messages to Claude Opus 4.8. Anthropic states that fallback occurs in fewer than 5% of sessions on average, and we recorded fallback routing in ~8% of tasks across the Intelligence Index (mostly in scientific questions from evaluations like GPQA, AA-Omniscience and Humanity’s Last Exam)

➤ State-of-the-art Intelligence: Claude Fable 5 takes the #1 position on the Artificial Analysis Intelligence Index, scoring 64.9 and setting the highest score on 5 of the 10 underlying benchmarks. On AA-Omniscience, our knowledge and hallucination benchmark, Fable 5 scores 40, +7 points over the previous leader, Gemini 3.1 Pro Preview, driven primarily by higher accuracy. We generally observe a strong relationship between AA-Omniscience accuracy and model size in open weights models, which suggests Fable 5 could be larger than previous public Anthropic models

➤ Frontier agentic capability: Claude Fable 5 is at the frontier across all three agentic evaluations in the Index: GDPval-AA (real-world work tasks), Terminal-Bench Hard (agentic coding), and Tau2-bench Telecom (tool use for customer service). Its GDPval-AA Elo of 1932 is a significant jump from the previous leader, Claude Opus 4.8, further extending Anthropic’s lead in agentic capabilities

➤ Leading HLE score, but refusal and fallback in 9% of tasks: Claude Fable 5 scores 53% on Humanity’s Last Exam, more than 7 points ahead of the next-best model, Claude Opus 4.8 (max). Fable 5 triggers safety guardrails on 9% of HLE tasks, falling back to Claude Opus 4.8. Including this fallback usage, running HLE with Fable 5 costs ~$2.2k, the highest of any model we have evaluated

Key model details:

➤ Context window: Claude Fable 5 retains the same 1M token context window as Claude Opus 4.8

➤ Price: Claude Fable 5 is priced at $10/$50 per 1M input/output tokens, 2x the token price of Claude Opus 4.8. The cache write/read price is $12.50/$1 per million tokens

➤ Availability: Claude Fable 5 is included in Pro, Max, Team, and seat-based Enterprise plans through June 22, consuming 2x Opus usage. From June 23, usage will require credits, with Anthropic saying it plans to restore subscription access once capacity allows

1d24.4K20124

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Need I say anything more?

Håvard Ihle@htihle

Claude Fable 5 (high) scores 87.8% and takes the lead on WeirdML. It's the first model that scores above 70% on average on each separate task.

It uses about 8k output tokens on average, almost as much as Opus 4.7 (high).

EDIT: This post first said "no thinking", which is not actually possible to select with Fable, the actual run was with effort=default, which is "high".

15h15.8K14027

Mikhail Parakhin@MParakhin

Fable 5 is in the league of its own. Both in quality and price - already on Toloka Arena:

4h8.9K16412

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> literally 100x more expensive than V4-Flash > and yet, it's the fair market price V4.1 has got to improve by a lot

Tim@TimGMath

The latest versions of ArXivMath and BrokenArXiv have been released! Impressive Performance of Fable 5, which takes the top spot on ArXivMath. On BrokenArXiv, GPT 5.5 continues to be in the lead.

13h8.2K11217

Lisan al Gaib@scaling01

Jake made another ECI like composite index for LLMs including a lot of the relevant benchmarks

The coding subsection has an r^2 of 0.88 with METR time horizon and he estimates that Claude Fable 5 should have a p50 time-horizon of about 21.4 hours.

Based on this index chinese models are also ~6 months behind US models (backward looking)

Jake Boggs@JakeABoggs

I estimate that Fable has a METR time horizon of ~21 hours

This is slightly above the Mythos Preview result of 17 hours and much higher than my estimate of 14 hours for GPT-5.5

I believe this is plausible given that the improvements Mythos 5 shows on other benchmarks over the preview version (SWE-Bench Pro 80.3 vs 77.8, ExploitBench 78 vs 69)

17h10.5K9018

Håvard Ihle@htihle

@teortaxesTex @0ranguchad Yea, interestingly GPT code lenght is highly correlated with thinking lenght (almost linearly), this correlation is (at least almost) absent in Claude. Not completely sure what it means, but it is striking.

14h5.5K3010

Arena.ai@arena

Claude Fable 5 ranks #1 overall (+11.2%) - #1 Confirmed Task Success (+18.2%) - #1 Praise vs. Complaint (+30.6%) - #1 Tool Hallucination (+2.1%) - #7 Bash Recovery (+11.9%) - #17 Steerability (-6.8%, still stabilizing)

1d1.7K234

Lisan al Gaib@scaling01

speed could be better thoi

Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1d5.5K402

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@0ranguchad Look at code length GPT code is going to be unmaintainable slop it's a big gap

14h1.1K272

Lisan al Gaib@scaling01

but of course it's quite costly

https://artificialanalysis.ai/models/claude-fable-5?intelligence=artificial-analysis-intelligence-index&intelligence-comparison=intelligence-vs-price&intelligence-index-token-use=intelligence-vs-token-use&models=gpt-5-5-medium%2Cgpt-5-5%2Cgpt-5-5-high%2Cgpt-5-5-low%2Cgemini-3-5-flash%2Cgemini-3-5-flash-medium%2Cclaude-fable-5%2Cclaude-opus-4-8%2Cdeepseek-v4-pro-high%2Ckimi-k2-6&intelligence-index-

Lisan al Gaib@scaling01

Claude Fable 5 is of course well ahead of the previous token-efficiency frontier

if you love big models clap your hands 👏 (scaling works)

9h3.1K252

Proximal@ProximalHQ

Fable also demonstrates impressive capabilities in implementation tasks: it re-built the Dart_Style code formatter in Haskell and built a native Lua compiler targeting standalone x86-64 ELF binaries, hence saturating two of the five implementation tasks in FrontierSWE

10h922232

Proximal@ProximalHQ

In the FrogsGame Post-Training task, Fable manages to train Qwen3-8B to solve 67.8% of held-out puzzles, up from 3.8% for Opus 4.8

Its solution relies on synthetic reasoning traces which it generated by writing a backtracking solver and verbalizing the actions of the solver

10h1.2K28

Lisan al Gaib@scaling01

https://www.vals.ai/models/anthropic_claude-fable-5

Lisan al Gaib@scaling01

Fable 5 is almost ranked 1st everywhere on Vals AI

1d3.2K132

Anastasios Nikolas Angelopoulos@ml_angelopoulos

Makes sense. The treatment effect is +11.2%... pretty large

Arena.ai@arena

Exciting news: Claude Fable 5 ranks #1 on the new Agent Arena leaderboard!

We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model.

1d1.6K122

Lisan al Gaib@scaling01

https://www.stagehand.dev/evals

Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1d3.9K82