Claude Opus 4.8 debuts on Agent Arena tied #1 with GPT 5.5 (High) for Thinking & ranked #8 for Non-Thinking.
The Opus 4.8 models show a small improvement over their predecessor 4.7 specifically when thinking is turned on. With thinking on, it completes more tasks than 4.7, but comes in slightly less steerable and slower to recover from bash errors. This variant also regresses on tool hallucination. With thinking off, it logs one of the highest tool hallucination rates on the leaderboard.
Agent Arena ranks models on real-world agentic tasks using a causal tracing methodology. A model’s net improvement indicates how it compares to the average model.
The thread breaks down how the two Opus 4.8 variants from @AnthropicAI scored across 5 signals, drawn from real tasks submitted by a global community of users.
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.








