/AI4h ago

Claude Opus 4.8 Ties GPT-5.5 For Top Agent Arena Thinking Rank

9166111613.3K
Arena.ai@arena

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-Thinking.

The Opus 4.8 models show a small improvement over their predecessor 4.7 specifically when thinking is turned on. With thinking on, it completes more tasks than 4.7, but comes in slightly less steerable and slower to recover from bash errors. This variant also regresses on tool hallucination. With thinking off, it logs one of the highest tool hallucination rates on the leaderboard.

Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how it compares to the average model.

The thread breaks down how the two Opus 4.8 variants from @AnthropicAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

4:56 PM · Jun 9, 2026 · 13.6K Views
Sentiment

Some users praise Claude Opus 4.8's thinking bump as useful after tying for top on the Agent Arena Leaderboard while others criticize its high hallucination rate despite the ranking.

Pos
50.0%
Neg
50.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.7KLIKES3
Arena.ai@arena

Claude Opus 4.8 Thinking ranks #2 overall (+9.1%) - #1 Confirmed Success (+10.8%) - #1 Praise vs. Complaint (+15.2%) - #2 Steerability (+9.1%) - #8 Bash Recovery (+10.3%) - #17 Tool Hallucination (0.0%)

4hViews 1.7KLikes 3
BOOKMARKS1
Arena.ai@arena

Head over to the Agent Arena leaderboard to dive into the details: http://arena.ai/leaderboard/agent

4hViews 1.4KLikes 1Bookmarks 1
REPLIES1
Arena.ai@arena

Claude Opus 4.8 ranks #8 overall (+4.3%) - #6 Confirmed Success (+6.4%) - #2 Praise vs. Complaint (+14.6%) - #6 Steerability (+7.7%) - #9 Bash Recovery (+7.8%) - #22 Tool Hallucination (-14.8%)

4hViews 330Likes 2
Arena.ai@arena

Learn more about the causal tracing methodology for Agent Arena on our blog: http://arena.ai/blog/agent-arena-methodology

4hViews 1.4KLikes 2Bookmarks 1
Ali Romman@aliromman_

@arena A little late to the party. . .

4hViews 183Likes 1
Karan@KaranD93

@arena Disappointed by @xai still

Please release soon and catch up. I'm rooting for you the most. But I just need you to be good and worth it

3hViews 84Likes 1
Prince does AI@princedoesai

@arena Opus 4.8 thinking bump is useful

3hViews 60
Ve@vaesmall

@arena "#1 on the board. Highest hallucination rate in tools."

1hViews 2
Utkarsh Singh@Utkarsh51557661

@arena claude's getting there. still feels like it's chasing GPT, though.

3hViews 1