/Tech1h ago

Claude Fable 5 leads Agent Arena leaderboard with an 18.23% task success rate, nearly doubling Claude Opus 4.8

Founder Anastasios Nikolas Angelopoulos calculated a +11.2% treatment effect.

1829883614.3K
Original post
Lisan al Gaib@scaling01#1064inTech

insane jump in confirmed successes and praises by users

1:37 PM · Jun 10, 2026 · 5.5K Views
Sentiment

Some users praise Claude Fable 5's token efficiency and anticipate Sonnet 5 while others dismiss the leaderboard wins as unrealistic and criticize its refusals on technical tasks like vision pipelines.

Pos
33.3%
Neg
66.7%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5.2KBOOKMARKS25LIKES158RETWEETS3REPLIES15
Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1hViews 5.2KLikes 158Bookmarks 25
Arena.ai@arena

Claude Fable 5 ranks #1 overall (+11.2%) - #1 Confirmed Task Success (+18.2%) - #1 Praise vs. Complaint (+30.6%) - #1 Tool Hallucination (+2.1%) - #7 Bash Recovery (+11.9%) - #17 Steerability (-6.8%, still stabilizing)

1hViews 1.7KLikes 23Bookmarks 4
Lisan al Gaib@scaling01

speed could be better thoi

Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1hViews 2.2KLikes 23Bookmarks 1
Arena.ai@arena

Learn more about the causal tracing methodology for Agent Arena on our blog: http://arena.ai/blog/agent-arena-methodology

1hViews 1.3KLikes 9Bookmarks 1
Arena.ai@arena

Head over to the Agent Arena leaderboard to dive into the details: http://arena.ai/leaderboard/agent

1hViews 1.4KLikes 8Bookmarks 1
Lisan al Gaib@scaling01

https://www.stagehand.dev/evals

Lisan al Gaib@scaling01

Fable 5 is also by far the best computer use model according to Stagehand Agent Evals

it also costs less than half of GPT-5.5

1hViews 1.1KLikes 6Bookmarks 0
Arena.ai@arena

Claude Fable 5 by @AnthropicAI leads by the widest margins over other top models like Opus-4.8 and GPT-5.5 on two key signals: confirmed task success rate and praise vs. complaint.

1hViews 2KLikes 31Bookmarks 6
Jackson C@CJackson26740

@scaling01 it refuses to anything in AL or ML for me - like design a vision pipeline

1hViews 62Likes 1
Wei-Lin Chiang@infwinston

Fable 5 is #1 in Agent Arena. Another exciting breakthrough from Anthropic!

28mViews 84Likes 0Bookmarks 0
Justin@JustinGorya

@scaling01 Mythos is a really great foundation model. i cant wait for Sonnet 5.

1hViews 64
Mariusz Kurman@mkurman88

@scaling01 Yeah, right. They probably didn't ask any questions possessing “catastrophic risk”.

1hViews 47
Neuralease@neuralease

@scaling01 I like how it's only slightly more expensive than Opus in practice.

They finally figured out token efficiency.

1hViews 20
haro@harobuilds

@scaling01 gemini flash at $0.029 per task doing 73.81% accuracy while gpt-5.5 charges 44x more for 76% is the number nobody wants to talk about

1hViews 1Likes 1
gum@gum1h0x

@scaling01 try vertex

1hViews 4
Matt@m13v_

a better computer-use model wins the short stagehand eval. the tax is hour 6 of a real desktop run, when every step re-reads pixels and font or layout drift compounds. structural AX/UIA trees do not accrue that cost. we built Terminator to drive desktops off those AX/UIA trees instead of pixels, https://t8r.tech/r/zzwg8x8g written with ai

1hViews 3
Alex YGift@Radipdegen

@scaling01 so this table has only fable listed at 11%? gpt-5.5 at 67% costs 2x. its all about what you are willing to pay for

1h
Rugbist@rugbist_

@scaling01 the cost gap is the part nobody wants to talk about

if performance is close and price is half, the choice writes itself

1h