/AI1d ago

LMSYS Chatbot Arena launches Agent Mode with Mistral 3.5 to evaluate models on complex, multi-step tool-use tasks

The platform treats agent evaluation as a causal experiment

33280316732.3K

#798

Original post

Anastasios Nikolas Angelopoulos#798

Arena.ai@arena

Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode!

Put models to work on your most complex real-world tasks, and see how they perform.

Your sessions will help shape the Agent Arena leaderboard.

Arena.ai@arena

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

11:30 AM · Jun 5, 2026 · 9.6K Views

/AI1d ago

LMSYS Chatbot Arena launches Agent Mode with Mistral 3.5 to evaluate models on complex, multi-step tool-use tasks

The platform treats agent evaluation as a causal experiment

33280316732.3K

#798

Original post

Anastasios Nikolas Angelopoulos#798

Arena.ai@arena

Mistral 3.5 by @MistralAI has been added to Arena's new Agent Mode!

Put models to work on your most complex real-world tasks, and see how they perform.

Your sessions will help shape the Agent Arena leaderboard.

Arena.ai@arena

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

11:30 AM · Jun 5, 2026 · 9.6K Views

Sentiment

Many users praise Arena's Agent Mode for delivering real-world benchmarking signals on frontier AI agents beyond isolated scores, while some call the tool buggy and question rankings like Gemini's.

Pos

72.2%

Neg

27.8%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS11.9KBOOKMARKS29LIKES70RETWEETS13REPLIES10

Rohan Paul@rohanpaul_ai

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.

The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.

The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.

Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.

Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.

The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.

The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.

The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.

GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.

The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.

Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

Arena.ai@arena

Introducing Agent Arena: real-world agentic evals at scale.

How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.

On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.

Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.

Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.

This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.

Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6

More analysis in the thread, with the full technical blog below.

22h11.9K7029

Arena.ai@arena

Have you tried out Agent Mode yet?

Use frontier AI agents to do your real work. Your sessions feed the data that ranks them on the Agent Arena leaderboard.

See details in thread to learn more about Agent Mode and Agent Arena. 👇

Arena.ai@arena

Introducing Agent Mode: Agentic AI is now measured in the Arena.

Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.

It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.

Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

4h7K6710

Rohan Paul@rohanpaul_ai

The core methodological breakthrough is that Arena treats agent evaluation as a causal experiment, not a popularity contest.

Instead of relying on pairwise votes or static benchmark prompts, it randomizes agent components and measures the effect of each component on real outcomes. That lets Arena estimate “net improvement”: how much better performance becomes because a model or system component was used.

The second big idea is that it evaluates the whole workflow, not just the final answer.

Agents are judged through traces: tool calls, file writes, shell errors, user corrections, praise, complaints, approvals, failed tool use, and recovery attempts. This matters because agent quality is mostly revealed after things go wrong.

The third key move is the five-signal scoring system: confirmed success, praise versus complaint, steerability, bash recovery, and tool hallucination. Together, these measure whether the agent completed the task, followed corrections, recovered from errors, and avoided pretending nonexistent tools exist.

The fourth pont is about scale and realism: 160,480 tasks, 2.06 million tool calls, and 40.3 million lines of code in one week. This is not toy evaluation.

The fifth is realized cost: Arena measures actual session cost, because some models become expensive by taking more steps or causing more user back-and-forth.

So the key shift is Arena measures agents as working systems under real pressure, using causal methods rather than vibe-based rankings.