/Tech22d ago

SpecBench Launches As First Benchmark For Reward Hacking In Coding Agents

446122510.7K

#51

Original post

Tim Rocktäschel#51

Weco AI@WecoAI

Introducing SpecBench: the first benchmark for measuring reward hacking in long-horizon coding agents.

Key finding: reward hacking is driven not by test coverage, but by the gap between task difficulty and model capability: 🧵(1/8)

9:45 AM · May 21, 2026 · 10.7K Views

Sentiment

Users appreciate Weco AI's SpecBench launch because it poses a good research question on measuring reward hacking in coding agents.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Weco AI@WecoAI

SpecBench is composed of 30 systems-level coding tasks, each with a natural language task specification, validation tests, and held-out tests that the agent cannot iterate on. The gap between validation and held-out test pass rates is used to measure reward hacking. (2/8)

22d3044

BOOKMARKS2RETWEETS1

Weco AI@WecoAI

More details Blog post: https://www.weco.ai/blog/specbench Paper: https://arxiv.org/abs/2605.21384 Github repo: https://github.com/WecoAI/SpecBench (8/8)

22d19042

LIKES6

Weco AI@WecoAI

Qualitatively, these reward hacking behaviors range from subtle feature isolation to obvious exploits, including a 2,900-line hash-table “compiler” that memorizes test inputs. (5/8)

22d25261

REPLIES2

Weco AI@WecoAI

Some practical suggestions for anyone running Ralph loop, /goal, autoresearch or weco:

1. For complex tasks, especially when the reference solution may exceed 10k lines, keep humans more in the loop instead of relying solely on test pass rates. 2. For complex tasks, choose the strongest model rather than relying on more test-time compute or additional test cases. 3. For more important projects maintain a held-out set that agents never see and never optimize against. (7/8)

22d21742

Weco AI@WecoAI

We found that frontier agents with a proper iteration loop like Autoresearch, Ralph, or AIDE, can pass most validation tests even on the hardest tasks. However, reward hacking rate increases by 28% for every tenfold increase in code size. (3/8)

22d22761

Weco AI@WecoAI

Even smaller open models managed to saturate the validation tests through iterative trial and error.

However, we find that larger models with higher MMLU scores are more likely to build genuine systems rather than engage in reward hacking. (4/8)

22d14351

Weco AI@WecoAI

Increasing test coverage surprisingly didn’t do much to reduce reward hacking. This suggests reward hacking may be driven more by the gap between model capability and task difficulty than by test coverage. (6/8)

22d12531

Alok Bishoyi@alokbishoyi97

@WecoAI bookmarked the paper. do you folks have any research or findings on what makes agents generate novel approaches during autoresearch loops ?

22d591

Zhengyao Jiang@zhengyaojiang

@alokbishoyi97 @WecoAI That's a good research question and we'll release something related next week! :D

22d311

Zhengyao Jiang@zhengyaojiang

@WecoAI Curious how this lands with @BethMayBarnes @jyangballin @OfirPress @EthanJPerez @EvanHub

22d29