Introducing SpecBench: the first benchmark for measuring reward hacking in long-horizon coding agents.
Key finding: reward hacking is driven not by test coverage, but by the gap between task difficulty and model capability: 🧵(1/8)
Introducing SpecBench: the first benchmark for measuring reward hacking in long-horizon coding agents.
Key finding: reward hacking is driven not by test coverage, but by the gap between task difficulty and model capability: 🧵(1/8)
Users appreciate Weco AI's SpecBench launch because it poses a good research question on measuring reward hacking in coding agents.

SpecBench is composed of 30 systems-level coding tasks, each with a natural language task specification, validation tests, and held-out tests that the agent cannot iterate on. The gap between validation and held-out test pass rates is used to measure reward hacking. (2/8)

More details Blog post: https://www.weco.ai/blog/specbench Paper: https://arxiv.org/abs/2605.21384 Github repo: https://github.com/WecoAI/SpecBench (8/8)

Qualitatively, these reward hacking behaviors range from subtle feature isolation to obvious exploits, including a 2,900-line hash-table “compiler” that memorizes test inputs. (5/8)

Some practical suggestions for anyone running Ralph loop, /goal, autoresearch or weco:
1. For complex tasks, especially when the reference solution may exceed 10k lines, keep humans more in the loop instead of relying solely on test pass rates. 2. For complex tasks, choose the strongest model rather than relying on more test-time compute or additional test cases. 3. For more important projects maintain a held-out set that agents never see and never optimize against. (7/8)

We found that frontier agents with a proper iteration loop like Autoresearch, Ralph, or AIDE, can pass most validation tests even on the hardest tasks. However, reward hacking rate increases by 28% for every tenfold increase in code size. (3/8)

Even smaller open models managed to saturate the validation tests through iterative trial and error.
However, we find that larger models with higher MMLU scores are more likely to build genuine systems rather than engage in reward hacking. (4/8)

Increasing test coverage surprisingly didn’t do much to reduce reward hacking. This suggests reward hacking may be driven more by the gap between model capability and task difficulty than by test coverage. (6/8)

@WecoAI bookmarked the paper. do you folks have any research or findings on what makes agents generate novel approaches during autoresearch loops ?

@alokbishoyi97 @WecoAI That's a good research question and we'll release something related next week! :D

@WecoAI Curious how this lands with @BethMayBarnes @jyangballin @OfirPress @EthanJPerez @EvanHub