5h ago

Audit Uncovers Defects In 168 LLM And Agent Benchmarks

0
Original post

Benchmarks are the measurement instruments of AI progress. We audited 168 LLM & agent benchmarks — Terminal-Bench 2, SWE-bench-verified, HLE, FinanceAgent v1.1, MMMU-Pro, +160 more. Many of them carry defects: ambiguous prompts, broken envs, or tests that grade something different than what the prompt asks.

10:36 AM · May 28, 2026 View on X