/Tech2h ago

New Paper Uses Agent Exploitation and Patching to Fix LLM Reward Hacks

35313253.1K

Original post unavailable.

/Tech2h ago

New Paper Uses Agent Exploitation and Patching to Fix LLM Reward Hacks

35313253.1K

Original post unavailable.

Sentiment

Positive users praise the new agent-based method for hardening LLM benchmarks against reward hacks as clean and express interest in connecting with the researchers.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS75REPLIES2

Ziqian Zhong@fjzzq2002

Benchmarks get reward-hacked constantly. KernelBench alone has a long list: monkey-patching timing functions, caching reference outputs, etc. Fixing these is painful and new exploits keep surfacing. But if agents can find exploits, can they automatically patch them too? (1/8)

2h752

LIKES2

Ziqian Zhong@fjzzq2002

A joint work with @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan!

Paper: http://arxiv.org/abs/2606.08960 Code: http://github.com/few-sh/harden-v0

2h292

Ziqian Zhong@fjzzq2002

These levers are quite effective. On KernelBench, with only Gemini 3 Flash in the loop, the hardened verifiers block every publicly documented exploit — hacks found by different models, RL-trained agents, and humans — dropping attack success from 62% to 0%. (4/8)

2h142

Ziqian Zhong@fjzzq2002

As part of this work, we also release a dataset of 323 hackable environments and 3632 exploit trajectories collected during our initial investigation. (7/8)

2h281

Ziqian Zhong@fjzzq2002

Benchmark hardening has been a game of whack-a-mole, but we think it could largely be a continuous, automated process. Our hacker-fixer-verifier loop is a first step toward that, and we're excited to continue exploring where it can take us! (8/8)

2h281

Ziqian Zhong@fjzzq2002

The hacker-fixer-verifier loop alternates three agents: a hacker exploits the task to pass the verifier, a fixer patches it to block the exploit, and a solver confirms the patch accepts valid solutions. Each patch forces a new exploit, surfacing ever-deeper vulnerabilities. (2/8)

2h251

Ziqian Zhong@fjzzq2002

We also see a surprising weak-to-strong result. Flash-built defenses block attacks from Gemini 3.1 Pro (76% → 0%) and Claude Opus 4.7 (61% → 0%). Verifier access + defense sharing lets a weaker model build defenses that hold against stronger attackers. (5/8)

2h181

Ziqian Zhong@fjzzq2002

Two simple levers we discovered: (1) letting the in-loop hacker read the verifier source, so it can make more informed exploits, and (2) a shared defense pool, so a fix discovered on one task automatically propagates to every other task sharing the same eval infrastructure. (3/8)

2h141

Ziqian Zhong@fjzzq2002

We also tested the loop on 77 tasks in the Terminal Bench. The tasks and exploits are much more diverse, but we also see a robustness improvement: unhinted attacks drop from 39% to 17%, documented exploits from 50% to 39%. (6/8)

2h131

Alexa Web3 (e/acc)@alexabelonix

@fjzzq2002 @IvSegal @neversupervised @Shashwa02469621 @kexun_zhang @AdtRaghunathan this is clean. 🤝 happy to connect ;)

2h13

GENTICIS_TOOLZ@genetic_toolz

@fjzzq2002 I can help send me a dm for help