Researcher Links LLM Hacker-Fixer Loop To Foundation Model Self-Play

Original post

@RobertTLange @fjzzq2002 Nice. Reminds me of Foundation Model Self Play https://arxiv.org/abs/2507.06466

Very cool paper on the "hacker-fixer loop" by @fjzzq2002 et al. 🚀 A 3-agent LLM system that automatically hardens benchmark verifiers against reward hacking:

1. 🦹 Hacker tries to pass the verifier without solving the task. 2. 👷 Fixer patches the exploit Solver checks legit solutions still pass. 3. 🔁 Repeat until no new exploits.

Two tricks: the hacker reads verifier source for targeted attacks, and a shared defense pool spreads infrastructure-level patches across all tasks 🧑‍🔧

Results: weak-to-strong hardening, a weaker model with info advantages builds defenses that beat stronger blind attackers.

Love the notion of leveraging adversarial test-time scaling for benchmark design.

📝: https://arxiv.org/abs/2606.08960 🧑‍💻: https://github.com/few-sh/harden-v0

6:59 AM · Jun 17, 2026 · 277 Views