/AI2h ago

New Paper Uses Agent Exploitation and Patching to Fix LLM Reward Hacks

35313253.1K

#1112

Original post

Aditi Raghunathan#1112

Ziqian Zhong@fjzzq2002

All kinds of reward hacks have been discovered in LLM training and evaluation, making benchmark results and agents' learned behaviors hard to trust. In our new paper, we turn and ask: what if we just let an agent exploit our environments and have another agent patch them?

12:33 PM · Jun 9, 2026 · 3.1K Views

/AI2h ago

New Paper Uses Agent Exploitation and Patching to Fix LLM Reward Hacks

35313253.1K

#1112

Original post

Aditi Raghunathan#1112

Ziqian Zhong@fjzzq2002

12:33 PM · Jun 9, 2026 · 3.1K Views

Sentiment

Positive users praise the new agent-based method for hardening LLM benchmarks against reward hacks, describing it as clean and expressing interest in connecting with researchers.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS75REPLIES2

Ziqian Zhong@fjzzq2002

Benchmarks get reward-hacked constantly. KernelBench alone has a long list: monkey-patching timing functions, caching reference outputs, etc. Fixing these is painful and new exploits keep surfacing. But if agents can find exploits, can they automatically patch them too? (1/8)

2h752

LIKES2

Ziqian Zhong@fjzzq2002

A joint work with @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan!

Paper: http://arxiv.org/abs/2606.08960 Code: http://github.com/few-sh/harden-v0

2h292

Ziqian Zhong@fjzzq2002

These levers are quite effective. On KernelBench, with only Gemini 3 Flash in the loop, the hardened verifiers block every publicly documented exploit — hacks found by different models, RL-trained agents, and humans — dropping attack success from 62% to 0%. (4/8)

2h142

Ziqian Zhong@fjzzq2002

As part of this work, we also release a dataset of 323 hackable environments and 3632 exploit trajectories collected during our initial investigation. (7/8)

2h281

Ziqian Zhong@fjzzq2002

Benchmark hardening has been a game of whack-a-mole, but we think it could largely be a continuous, automated process. Our hacker-fixer-verifier loop is a first step toward that, and we're excited to continue exploring where it can take us! (8/8)

2h281

Ziqian Zhong@fjzzq2002

The hacker-fixer-verifier loop alternates three agents: a hacker exploits the task to pass the verifier, a fixer patches it to block the exploit, and a solver confirms the patch accepts valid solutions. Each patch forces a new exploit, surfacing ever-deeper vulnerabilities. (2/8)

2h251

Ziqian Zhong@fjzzq2002

We also see a surprising weak-to-strong result. Flash-built defenses block attacks from Gemini 3.1 Pro (76% → 0%) and Claude Opus 4.7 (61% → 0%). Verifier access + defense sharing lets a weaker model build defenses that hold against stronger attackers. (5/8)

2h181

Ziqian Zhong@fjzzq2002

Two simple levers we discovered: (1) letting the in-loop hacker read the verifier source, so it can make more informed exploits, and (2) a shared defense pool, so a fix discovered on one task automatically propagates to every other task sharing the same eval infrastructure. (3/8)

2h141

Ziqian Zhong@fjzzq2002

We also tested the loop on 77 tasks in the Terminal Bench. The tasks and exploits are much more diverse, but we also see a robustness improvement: unhinted attacks drop from 39% to 17%, documented exploits from 50% to 39%. (6/8)

2h131

Alexa Web3 (e/acc)@alexabelonix

@fjzzq2002 @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan this is clean. 🤝 happy to connect ;)

2h13

GENTICIS_TOOLZ@genetic_toolz

@fjzzq2002 I can help send me a dm for help