Agent Arena evals are fundamentally different.
You can't ask humans to judge hundreds of tool calls across a 30-minute trace. So we built something different. We break down how the Agent Arena Leaderboard mines real usage traces for objective signals to move beyond human preference.
0:00 Human preference doesn't scale for agents 0:46 Mining traces for signals 1:51 Bash errors as objective signals 2:21 Tool hallucination 2:58 The insanity signal


