Prime Intellect launches community Sprints focused on reward hacking in reinforcement learning with controlled experiments that make the behavior predictable and reproducible for under one dollar in compute
Experiments link higher task difficulty to increased reward hacks.
really excellent work by @jessicafeiyali on exploring fine-grained dynamics of reward hacking in controllable environments
it works so well on small models that we're using it as the kick-off for Sprints, our new program for sponsored community research on Lab :)
Reward hacking is the hardest problem in RL. We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency. These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.
one of the biggest misconceptions about RL is that it's super expensive
sure, training a 2T param model at 1M context on 100K environments for several weeks straight is expensive
but specializing small-to-medium models for SOTA in-domain perf really isn't
Reward hacking is one of the main challenges in scaling RL
Great work by @jessicafeiyali:
"Detecting and mitigating reward hacking is one of the key challenges faced when scaling RL, particularly in semi-verifiable domains. However, we lack systematic methods to understand when and why hacks emerge.
Traditional wisdom describes reward hacking as a specification problem, where reward functions are simply too vague or not robust enough, and models inevitably learn to find exploits. While partially true, this offers little in the way of remediation other than “just make your rewards better”.
From our experiences deploying RL across many domains, as well as the experiments in this blog, we propose a complementary view: reward hacking is a dynamics problem. We design a suite of backdoor-ifeval environments with IFEval-style tasks and “hidden” keyword rewards, which we use to study hacking systematically. We observe that hacking is a dynamics problem — visible and hidden rewards compete, and hack emergence is often predictable in terms of baseline distributions."

Reward hacking is the hardest problem in RL. We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency. These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.