/AI4h ago

Claude Opus 4.8 Ties GPT-5.5 For Top Agent Arena Thinking Rank

9166111613.3K

#798

Original post

Anastasios Nikolas Angelopoulos#798

Arena.ai@arena

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-Thinking.

The Opus 4.8 models show a small improvement over their predecessor 4.7 specifically when thinking is turned on. With thinking on, it completes more tasks than 4.7, but comes in slightly less steerable and slower to recover from bash errors. This variant also regresses on tool hallucination. With thinking off, it logs one of the highest tool hallucination rates on the leaderboard.

Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how it compares to the average model.

The thread breaks down how the two Opus 4.8 variants from @AnthropicAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

4:56 PM · Jun 9, 2026 · 13.6K Views

/AI4h ago

Claude Opus 4.8 Ties GPT-5.5 For Top Agent Arena Thinking Rank

9166111613.3K

#798

Original post

Anastasios Nikolas Angelopoulos#798

Arena.ai@arena

Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-Thinking.

Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how it compares to the average model.

The thread breaks down how the two Opus 4.8 variants from @AnthropicAI scored across 5 signals, drawn from real tasks submitted by a global community of users.

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

4:56 PM · Jun 9, 2026 · 13.6K Views

Sentiment

Some users praise Claude Opus 4.8's thinking bump as useful after tying for top on the Agent Arena Leaderboard while others criticize its high hallucination rate despite the ranking.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.7KLIKES3

Arena.ai@arena

Claude Opus 4.8 Thinking ranks #2 overall (+9.1%) - #1 Confirmed Success (+10.8%) - #1 Praise vs. Complaint (+15.2%) - #2 Steerability (+9.1%) - #8 Bash Recovery (+10.3%) - #17 Tool Hallucination (0.0%)

4h1.7K3

BOOKMARKS1

Arena.ai@arena

Head over to the Agent Arena leaderboard to dive into the details: http://arena.ai/leaderboard/agent

4h1.4K11

REPLIES1

Arena.ai@arena

Claude Opus 4.8 ranks #8 overall (+4.3%) - #6 Confirmed Success (+6.4%) - #2 Praise vs. Complaint (+14.6%) - #6 Steerability (+7.7%) - #9 Bash Recovery (+7.8%) - #22 Tool Hallucination (-14.8%)

4h3302

Arena.ai@arena

Learn more about the causal tracing methodology for Agent Arena on our blog: http://arena.ai/blog/agent-arena-methodology

4h1.4K21