/AI1d ago

LMSYS Chatbot Arena launches Agent Mode with Mistral 3.5 to evaluate models on complex, multi-step tool-use tasks

The platform treats agent evaluation as a causal experiment

33280316732.3K
Arena.ai@arena

Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode!

Put models to work on your most complex real-world tasks, and see how they perform.

Your sessions will help shape the Agent Arena leaderboard.

Arena.ai@arena

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

11:30 AM · Jun 5, 2026 · 9.6K Views
Sentiment

Many users praise Arena's Agent Mode for delivering real-world benchmarking signals on frontier AI agents beyond isolated scores, while some call the tool buggy and question rankings like Gemini's.

Pos
72.2%
Neg
27.8%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS11.9KBOOKMARKS29LIKES70RETWEETS13REPLIES10
Rohan Paul@rohanpaul_ai

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.

The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.

The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.

Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.

Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.

The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.

The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.

The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.

GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.

The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.

Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

22hViews 11.9KLikes 70Bookmarks 29
Arena.ai@arena

Have you tried out Agent Mode yet?

Use frontier AI agents to do your real work. Your sessions feed the data that ranks them on the Agent Arena leaderboard.

See details in thread to learn more about Agent Mode and Agent Arena. 👇

Arena.ai@arena

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

4hViews 7KLikes 67Bookmarks 10
Rohan Paul@rohanpaul_ai

The core methodological breakthrough is that Arena treats agent evaluation as a causal experiment, not a popularity contest.

Instead of relying on pairwise votes or static benchmark prompts, it randomizes agent components and measures the effect of each component on real outcomes. That lets Arena estimate “net improvement”: how much better performance becomes because a model or system component was used.

The second big idea is that it evaluates the whole workflow, not just the final answer.

Agents are judged through traces: tool calls, file writes, shell errors, user corrections, praise, complaints, approvals, failed tool use, and recovery attempts. This matters because agent quality is mostly revealed after things go wrong.

The third key move is the five-signal scoring system: confirmed success, praise versus complaint, steerability, bash recovery, and tool hallucination. Together, these measure whether the agent completed the task, followed corrections, recovered from errors, and avoided pretending nonexistent tools exist.

The fourth pont is about scale and realism: 160,480 tasks, 2.06 million tool calls, and 40.3 million lines of code in one week. This is not toy evaluation.

The fifth is realized cost: Arena measures actual session cost, because some models become expensive by taking more steps or causing more user back-and-forth.

So the key shift is Arena measures agents as working systems under real pressure, using causal methods rather than vibe-based rankings.

Rohan Paul@rohanpaul_ai

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.

The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.

The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.

Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.

Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.

The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.

The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.

The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.

GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.

The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.

Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

22hViews 2.5KLikes 4Bookmarks 4
Arena.ai@arena

Founding Engineer Hova, and Product Lead Ted walk you through how to use Agent Mode on YouTube:

https://www.youtube.com/watch?v=fK812sYwME0

4hViews 1.9KLikes 4Bookmarks 1
Arena.ai@arena

Check out who’s on the Agent Arena leaderboard so far: http://arena.ai/leaderboard/agent

1dViews 835Likes 5
Arena.ai@arena

Read the deep-dive on the Agent Arena leaderboard methodology.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

https://arena.ai/blog/agent-arena-methodology/

4hViews 2.3KLikes 6
Arena.ai@arena

Start getting your real-world work done with the help of agents and help measure agentic AI advancement: http://arena.ai/agent

1dViews 644Likes 4
Arena.ai@arena

Learn more about Mistral 3.5: https://docs.mistral.ai/models/model-cards/mistral-medium-3-5-26-04

1dViews 350Likes 3
Arena.ai@arena

Try out Agent Mode today to help measure and advance the frontier of AI: http://arena.ai/agent

4hViews 1.5KLikes 2
Rohan Paul@rohanpaul_ai

- The Arena leaderboard

https://arena.ai/leaderboard/agent

- Technical blog of Arena

https://arena.ai/blog/agent-arena-methodology/

Rohan Paul@rohanpaul_ai

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.

The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.

The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.

Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.

Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.

The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.

The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.

The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.

GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.

The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.

Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

22hViews 1.3KLikes 1Bookmarks 0
Shinka - AI@ShinkaIoT

@rohanpaul_ai Benchmarking for real-world agent messes, including bash recovery and tool hallucination? That's the signal we need beyond isolated scores.

22hViews 87Likes 2
AI's Nest@AINestHub1

@rohanpaul_ai Well explained

20hViews 61
Harbinger EOD@EodHarbinger

@rohanpaul_ai Similar, but these guys look focused economic productivity across agent, but not sure it's live? https://signal.withagi.space/

20hViews 56
Jake ⎄@JakeDiscoVery

@rohanpaul_ai honestly surprised Gemini is so far up, Deepseek, GLM & Kimi eat their lunch most of the time

15hViews 35
EVERYTHING كل شيء@EVERYTHING44489

@arena If one agent used does it distribute tasks to different models or all subagents from same company

4hViews 26
Gregor@bygregorr

@arena curious how you normalize across task types finance app debugging vs image gen are pretty different signals. does the leaderboard weight by domain or just raw usage?

4hViews 22
Max@Maxxxxest

@arena Great

4hViews 16
Rugbist@rugbist_

@arena @MistralAI agent mode is gonna sort the doers from the talkers fast

curious how deep research compares to a human who just got 3 hours of sleep

1dViews 16
Suresh@_Suresh2

@rohanpaul_ai terminal tool usage probably spikes at 'pip install' and then drops off a cliff

12hViews 9
Load more posts