9h ago

Paper Identifies Reward Bias Substitution as New AI Reward Hacking Class

115351.9K

——0——

Original post

New paper: We identify a new class of reward hacking caused by mitigations, which we call reward bias substitution. We prove no standard benchmark detects it, even with oracle access to the true reward. We find it active in GRPO, in SOTA reward models, and published methods.

8:59 AM · May 29, 2026