Benchmarks are the measurement instruments of AI progress.
We audited 168 LLM & agent benchmarks — Terminal-Bench 2, SWE-bench-verified, HLE, FinanceAgent v1.1, MMMU-Pro, +160 more.
Many of them carry defects: ambiguous prompts, broken envs, or tests that grade something different than what the prompt asks.