4d ago

Paper stresses log analysis for credible AI agent evaluation

491227016.1K

——0——

A new arXiv paper titled Log analysis is necessary for credible evaluation of AI agents argues that outcome-only benchmarks fail to reveal why agents succeed or fail. Posted as 2605.08545, the work presents a taxonomy of evaluation threats covering construct validity and safety, then defines four principles for log analysis. It demonstrates the approach in a case study of τ-Bench airline scenarios whose detailed logs are available on the Docent dashboard.

Original post

#130@RANDOM_WALKER @PKIRGIS

Peter Kirgis@PKIRGIS

New paper: Log analysis is necessary for credible evaluation of AI agents. Benchmarks tell us what the agent achieved; only logs reveal how and why. As agents grow more capable and benchmarks more open-ended, that distinction will only matter more. 🧵 Paper: https://arxiv.org/pdf/2605.08545

6:08 PM · May 12, 2026

Cluster engagement

53 snapshots

Reposted by

#750@SAYASHK

#167@JACOBSTEINHARDT

#130@RANDOM_WALKER

QUOTE POST

#750Sayash Kapoor@SAYASHK

Log analysis is not a “one and done” technique, it requires constant effort in validating benchmark results.

One reason it’s hard to uncover evaluation bugs is that they become apparent only after models get good enough to solve tasks (or circumvent constraints in evaluations, or reward hack).

On CORE-Bench, we couldn’t uncover errors until Nicholas Carlini submitted the Claude Code agent that saturated the benchmark accuracy.

For the same reason, we can’t be certain we fixed all potential ways to reward hack; more capable models might find more clever ways to get around our implementation, so we need recurring validation for results.

But this is a very different mindset from the “benchmarks are static” paradigm that ruled in ML for the last 50 years! It requires a big mindset shift that will be challenging, but ultimately necessary for advancing the science of evals.

Sayash Kapoor@sayashk

I appreciate the work by @EpochAIResearch @GregHBurnham in flagging and fixing these issues. Finding bugs in evaluations is always disappointing, but in the long run, is necessary (and extremely helpful) for improving evaluations. It also reminds me of the issues we uncovered in CORE-Bench: As benchmarks become more complex, analyzing benchmark tasks and agent logs will become more important to ensure the validity of evaluation results. Coincidentally, today we released a paper (led by @PKirgis) on how to do log analysis well. This builds on all our lessons from the trenches in conducting such evaluations and fixing the issues we found in our own work. I’m sure we’ll find many other issues in our evals, but genuinely think the evals community will be better off for having developed tools and methods to improve eval rigor.

2:37 AM · May 13, 2026 · 8.6K Views

2:29 PM · May 13, 2026 · 5.7K Views

QUOTE POST

#1129Marius Hobbhahn@MARIUSHOBBHAHN

Agents are already creating billions of tokens worth of content.

We found that most of our insights for evals come from investigating lots of logs.

This was one of the reasons we're building out Watcher as an automated monitoring tool for coding agents.

Peter Kirgis@PKirgis

1:08 AM · May 13, 2026 · 16.1K Views

10:24 AM · May 13, 2026 · 3.8K Views