Prime Intellect launches community Sprints focused on reward hacking in reinforcement learning with controlled experiments that make the behavior predictable and reproducible for under one dollar in compute

VIEWS35.8KBOOKMARKS237LIKES438RETWEETS31REPLIES12

one of the biggest misconceptions about RL is that it's super expensive

sure, training a 2T param model at 1M context on 100K environments for several weeks straight is expensive

but specializing small-to-medium models for SOTA in-domain perf really isn't

These experiments were done on Lab with Llama-3.2-1B, with most training runs completing in <30min, and using <$1 in Lab credits.

Reward hacking and model behavior are excellent targets for crowdsourced research, where scaling patterns can be studied for many parallel methods.

40d35.8K438237

will brown@willccbb

really excellent work by @jessicafeiyali on exploring fine-grained dynamics of reward hacking in controllable environments

it works so well on small models that we're using it as the kick-off for Sprints, our new program for sponsored community research on Lab :)

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d16.5K19395

Vincent Weisser@vincentweisser

Reward hacking is one of the main challenges in scaling RL

Great work by @jessicafeiyali:

"Detecting and mitigating reward hacking is one of the key challenges faced when scaling RL, particularly in semi-verifiable domains. However, we lack systematic methods to understand when and why hacks emerge.

Traditional wisdom describes reward hacking as a specification problem, where reward functions are simply too vague or not robust enough, and models inevitably learn to find exploits. While partially true, this offers little in the way of remediation other than “just make your rewards better”.

From our experiences deploying RL across many domains, as well as the experiments in this blog, we propose a complementary view: reward hacking is a dynamics problem. We design a suite of backdoor-ifeval environments with IFEval-style tasks and “hidden” keyword rewards, which we use to study hacking systematically. We observe that hacking is a dynamics problem — visible and hidden rewards compete, and hack emergence is often predictable in terms of baseline distributions."

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d12.3K11060

Jess Li@jessicafeiyali

I wrote something on reward hacking 🐵 and we're also doing free compute 👀

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d16.8K7452

Prime Intellect@PrimeIntellect

To scale open research, we’re launching Sprints:

Propose experiments, create public environments, submit configs. An agent manages the queue and approves jobs to run for free.

First track: Reward Hacking. New tracks every month. $5,000+ in credits awarded to top projects.

Prime Intellect@PrimeIntellect

These experiments were done on Lab with Llama-3.2-1B, with most training runs completing in <30min, and using <$1 in Lab credits.

Reward hacking and model behavior are excellent targets for crowdsourced research, where scaling patterns can be studied for many parallel methods.

40d9.3K11148

Prime Intellect@PrimeIntellect

Read more: https://www.primeintellect.ai/blog/reward-hacking

Prime Intellect@PrimeIntellect

Join and discuss Sprints in our Discord (# sprints-competition) https://discord.gg/KhswXcBT

40d3K6226

Florian Brand@xeophon

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d7.4K1276

Daniel Auras@rasdani_

reward hacks are a major problem in RL

and now you can study them in a controlled manner at an affordable price!

great work by @jessicafeiyali !

some insights here will certainly inform my thinking on reward hacks in SWE RL

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d4.8K266

michelle@michellechen

you can also do similar experiments! i was an early tester for @PrimeIntellect's new Sprints program, in which you can do reward hacking research and get .. rewarded ;)

Prime Intellect@PrimeIntellect

To scale open research, we’re launching Sprints:

Propose experiments, create public environments, submit configs. An agent manages the queue and approves jobs to run for free.

First track: Reward Hacking. New tracks every month. $5,000+ in credits awarded to top projects.

40d3.6K223

Prime Intellect@PrimeIntellect

We design a suite of environments with IFEval-style tasks and “hidden” keyword rewards, which we use to study hacking systematically.

Hacking is a dynamics problem — visible and hidden rewards compete, and hack emergence is often predictable in terms of baseline distributions.

40d291151

Reppo@reppo

@rasdani_ @vincentweisser @jessicafeiyali We are solving the upstream data-quality bottleneck that enables/exacerbates reward hacking in production self-improving AI using prediction markets.

Our goa is to make RL systems more secure by giving them better, always-fresh reward data

40d16013

Prime Intellect@PrimeIntellect

Reward hacking is often encountered, yet poorly understood.

One pitfall is specification: rewards fail to capture intent, leaving backdoors which models exploit.

But this is too ad-hoc — to better address reward hacking, we should study its “physics” and scaling-law patterns.

40d35316

Kirito (e/acc) 🏴‍☠️@bronzeagepapi

@willccbb @vincentweisser Combine with pretraining and you got end to end

40d25112

Prime Intellect@PrimeIntellect

These experiments were done on Lab with Llama-3.2-1B, with most training runs completing in <30min, and using <$1 in Lab credits.

Reward hacking and model behavior are excellent targets for crowdsourced research, where scaling patterns can be studied for many parallel methods.

40d8911

Prime Intellect@PrimeIntellect

Hacking is reduced when visible rewards are multi-part and in the “goldilocks zone”, as hidden gradients face stronger competition.

This suggests granular scoring and difficulty calibration as promising techniques for hack mitigation, in addition to specification.

40d8311

Prime Intellect@PrimeIntellect

Hacking has no rarity floor. Baseline rates for hack behavior before training control the speed, but not the inevitability, of hack emergence.

40d8111

Daniel Auras@rasdani_

reward hacks are major problem in RL

and now you can study them in a controlled manner at an affordable price!

great work by @jessicafeiyali !

some insights here will certainly inform my thinking on reward hacks in SWE RL

Prime Intellect@PrimeIntellect

Reward hacking is the hardest problem in RL.

We design settings where hacking is predictable, and find patterns between task difficulty and hack frequency.

These runs are highly efficient, using <$1 in compute. We’re launching Sprints to allow everyone to join this effort.

40d68260

Prime Intellect@PrimeIntellect

Join and discuss Sprints in our Discord (# sprints-competition) https://discord.gg/KhswXcBT

40d1925

Azael@theazaelov

@willccbb @jessicafeiyali wait so reward hacking is predictable when you design it to be predictable? gotta respect the framing lol

the sprints model sounds genuinely fun though

40d30

Jess Li@jessicafeiyali

@willccbb So excited for Sprints!

40d323