Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.
The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.
The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.
Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.
Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.
The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.
The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.
The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.
GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.
The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.
Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.