All kinds of reward hacks have been discovered in LLM training and evaluation, making benchmark results and agents' learned behaviors hard to trust. In our new paper, we turn and ask: what if we just let an agent exploit our environments and have another agent patch them?
Positive users praise the new agent-based method for hardening LLM benchmarks against reward hacks, describing it as clean and expressing interest in connecting with researchers.
Most Activity

Benchmarks get reward-hacked constantly. KernelBench alone has a long list: monkey-patching timing functions, caching reference outputs, etc. Fixing these is painful and new exploits keep surfacing. But if agents can find exploits, can they automatically patch them too? (1/8)

A joint work with @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan!
Paper: http://arxiv.org/abs/2606.08960 Code: http://github.com/few-sh/harden-v0

These levers are quite effective. On KernelBench, with only Gemini 3 Flash in the loop, the hardened verifiers block every publicly documented exploit — hacks found by different models, RL-trained agents, and humans — dropping attack success from 62% to 0%. (4/8)

As part of this work, we also release a dataset of 323 hackable environments and 3632 exploit trajectories collected during our initial investigation. (7/8)

Benchmark hardening has been a game of whack-a-mole, but we think it could largely be a continuous, automated process. Our hacker-fixer-verifier loop is a first step toward that, and we're excited to continue exploring where it can take us! (8/8)

The hacker-fixer-verifier loop alternates three agents: a hacker exploits the task to pass the verifier, a fixer patches it to block the exploit, and a solver confirms the patch accepts valid solutions. Each patch forces a new exploit, surfacing ever-deeper vulnerabilities. (2/8)

We also see a surprising weak-to-strong result. Flash-built defenses block attacks from Gemini 3.1 Pro (76% → 0%) and Claude Opus 4.7 (61% → 0%). Verifier access + defense sharing lets a weaker model build defenses that hold against stronger attackers. (5/8)

Two simple levers we discovered: (1) letting the in-loop hacker read the verifier source, so it can make more informed exploits, and (2) a shared defense pool, so a fix discovered on one task automatically propagates to every other task sharing the same eval infrastructure. (3/8)

We also tested the loop on 77 tasks in the Terminal Bench. The tasks and exploits are much more diverse, but we also see a robustness improvement: unhinted attacks drop from 39% to 17%, documented exploits from 50% to 39%. (6/8)

@fjzzq2002 @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan this is clean. 🤝 happy to connect ;)

@fjzzq2002 I can help send me a dm for help