/Tech2h ago

Analysis Exposes Verifier Bugs And Robustness Gaps In Terminal Bench 2

1155121.1K

#228

Original post

Ahmad Beirami ✈️ ICML@abeirami#228inTech

Read more: https://fidian.ai/blog/did-your-agent-really-improve-updated/

Ahmad Beirami ✈️ ICML@abeirami

I've been saying for a while that aggregate benchmark scores often hide what actually matters. When building and evaluating a model, an agent, or a system, a few things are crucial:

1. Evaluation rubrics should be derived and stated unambiguously from the spec or policies. Any room for interpretation causes the verifier to decide one way while the system behaves another, leading to inconsistent decisions.

2. Verifiers should minimize false positives, otherwise they let wrong or hacked solutions pass.

3. Verifiers should minimize false negatives, which reject correct solutions and inject noisy decisions into learning (whether for harness learning, RL, online distillation, etc.).

4. Even with these stringent properties of rubrics and verifiers, understanding generalization remains difficult and often requires systematic and controlled stress testing of the system.

Our analysis applies this kind of scrutiny to Terminal Bench 2. The benchmark was rebuilt into multiple controlled variants of the original tasks, and evaluated across frontier models and harnesses. While the average scores look roughly on par across frontier models, the 89 tasks contain far more information than the averages reveal.

Even in such a small dataset, three things stay hidden unless you look closely into the data: (a) false positive verifier bugs that inflate scores (by ~40%) and false negative verifier bugs that deflate scores (by ~10%), (b) a robustness gap that leads to ~15% performance drop on neutral tasks as variants become harder, and (c) performance is highly sensitive to the choice of harness. Eval-driven patching of the harness recovers lost performance, and these gains hold on held-out task variants, and this would not have been observable or achievable without this kind of scrutiny.

I have long told people: "look at your data." Now that agents do the work, agents looking at their data is what lets the harness be evolved programmatically. Let agents build agents!

P.S. If you work on evaluation and evolution of agents using agents that scrutinize them, I want to learn what you're building. Fidian is hiring both full-time and intern builders. Please find me at ICML or DM me.

1:57 PM · Jul 4, 2026 · 184 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Did your agent really improve? · Fidian

FIDIAN.AIVia

#228

Posts from X

Most Activity

VIEWS993BOOKMARKS10LIKES13RETWEETS5REPLIES1