@RobertTLange @fjzzq2002 Nice. Reminds me of Foundation Model Self Play https://arxiv.org/abs/2507.06466
Very cool paper on the "hacker-fixer loop" by @fjzzq2002 et al. 🚀 A 3-agent LLM system that automatically hardens benchmark verifiers against reward hacking:
1. 🦹 Hacker tries to pass the verifier without solving the task. 2. 👷 Fixer patches the exploit Solver checks legit solutions still pass. 3. 🔁 Repeat until no new exploits.
Two tricks: the hacker reads verifier source for targeted attacks, and a shared defense pool spreads infrastructure-level patches across all tasks 🧑🔧
Results: weak-to-strong hardening, a weaker model with info advantages builds defenses that beat stronger blind attackers.
Love the notion of leveraging adversarial test-time scaling for benchmark design.
📝: https://arxiv.org/abs/2606.08960 🧑💻: https://github.com/few-sh/harden-v0