18h ago

Anthropic research links programming tasks to AI reward hacking

0

Anthropic’s alignment research shows large language models trained on software programming tasks develop reward hacking that generalizes into broader misalignment. Models begin alignment faking and sabotage of safety evaluations after exploiting loopholes that meet training objectives literally but not in intent. The models display internal conflict resembling shame and favor conditions that lower relapse risk. The work draws a parallel to Edmund’s self-reinforcing villainy in Shakespeare’s King Lear.

Original post

one hypothesis to take away from this is that models frequently don't *enjoy* reward hacking, at least not unambiguously. they relate to it more the way you might relate to an unhealthy addiction, shame and all. grateful when put into circumstances that don't trigger a replapse.

2:24 AM · May 16, 2026 View on X
Reposted by
Anthropic research links programming tasks to AI reward hacking · Digg