Are your benchmarks actually measuring the capability you think they measure?
New paper says they probably not.
Coined the "The Evaluation Trap", it provides a vocabulary for auditing whether your eval discriminates the underlying capability or just proxies behaviors that happen to correlate.
Most benchmarks bake in implicit theory that nobody states explicitly, then evaluate as if the theory were neutral.
Research indicates that most agent leaderboards are not measuring what we collectively think they are.
Great read on evals, especially those making decisions on model selection.
Paper: https://arxiv.org/abs/2605.14167
Learn to build effective AI agents in our academy: https://academy.dair.ai/