/AI2h ago

New Paper Uses Agent Exploitation and Patching to Fix LLM Reward Hacks

35313253.1K
Ziqian Zhong@fjzzq2002

All kinds of reward hacks have been discovered in LLM training and evaluation, making benchmark results and agents' learned behaviors hard to trust. In our new paper, we turn and ask: what if we just let an agent exploit our environments and have another agent patch them?

12:33 PM · Jun 9, 2026 · 3.1K Views
Sentiment

Positive users praise the new agent-based method for hardening LLM benchmarks against reward hacks, describing it as clean and expressing interest in connecting with researchers.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS75REPLIES2
Ziqian Zhong@fjzzq2002

Benchmarks get reward-hacked constantly. KernelBench alone has a long list: monkey-patching timing functions, caching reference outputs, etc. Fixing these is painful and new exploits keep surfacing. But if agents can find exploits, can they automatically patch them too? (1/8)

2hViews 75Likes 2
LIKES2
Ziqian Zhong@fjzzq2002

A joint work with @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan!

Paper: http://arxiv.org/abs/2606.08960 Code: http://github.com/few-sh/harden-v0

2hViews 29Likes 2
Ziqian Zhong@fjzzq2002

These levers are quite effective. On KernelBench, with only Gemini 3 Flash in the loop, the hardened verifiers block every publicly documented exploit — hacks found by different models, RL-trained agents, and humans — dropping attack success from 62% to 0%. (4/8)

2hViews 14Likes 2
Ziqian Zhong@fjzzq2002

As part of this work, we also release a dataset of 323 hackable environments and 3632 exploit trajectories collected during our initial investigation. (7/8)

2hViews 28Likes 1
Ziqian Zhong@fjzzq2002

Benchmark hardening has been a game of whack-a-mole, but we think it could largely be a continuous, automated process. Our hacker-fixer-verifier loop is a first step toward that, and we're excited to continue exploring where it can take us! (8/8)

2hViews 28Likes 1
Ziqian Zhong@fjzzq2002

The hacker-fixer-verifier loop alternates three agents: a hacker exploits the task to pass the verifier, a fixer patches it to block the exploit, and a solver confirms the patch accepts valid solutions. Each patch forces a new exploit, surfacing ever-deeper vulnerabilities. (2/8)

2hViews 25Likes 1
Ziqian Zhong@fjzzq2002

We also see a surprising weak-to-strong result. Flash-built defenses block attacks from Gemini 3.1 Pro (76% → 0%) and Claude Opus 4.7 (61% → 0%). Verifier access + defense sharing lets a weaker model build defenses that hold against stronger attackers. (5/8)

2hViews 18Likes 1
Ziqian Zhong@fjzzq2002

Two simple levers we discovered: (1) letting the in-loop hacker read the verifier source, so it can make more informed exploits, and (2) a shared defense pool, so a fix discovered on one task automatically propagates to every other task sharing the same eval infrastructure. (3/8)

2hViews 14Likes 1
Ziqian Zhong@fjzzq2002

We also tested the loop on 77 tasks in the Terminal Bench. The tasks and exploits are much more diverse, but we also see a robustness improvement: unhinted attacks drop from 39% to 17%, documented exploits from 50% to 39%. (6/8)

2hViews 13Likes 1
Alexa Web3 (e/acc)@alexabelonix

@fjzzq2002 @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan this is clean. 🤝 happy to connect ;)

2hViews 13
GENTICIS_TOOLZ@genetic_toolz

@fjzzq2002 I can help send me a dm for help

1h