1d ago

Researchers propose cancellation hypothesis for GRPO in LLM post-training

0

Researchers proposed the cancellation hypothesis to explain the effectiveness of critic-free reinforcement learning methods such as GRPO in large language model post-training. The hypothesis states that sequence-level rewards produce implicit token-level credit assignment through gradient cancellation from positive and negative rollouts. Ivan Titov, professor of Natural Language Processing at the University of Edinburgh and University of Amsterdam, noted that batching choices in GRPO implementations weaken this cancellation effect and reduce performance when rollouts for the same prompt span multiple minibatches.

Original post

Critic-free RL (e.g. GRPO) is very effective in LLM post-training, but why? We propose the💥cancellation hypothesis💥: sequence-level rewards implicitly assign credits to individual tokens through the cancellation of gradients from pos/neg rollouts.

5:28 AM · May 15, 2026 View on X
Reposted by

Very happy to see this out - great work led by @crazycth0901 and @ZeroyuHuang ! One takeaway: batching in GRPO is not a minor detail. In many implementations, rollouts for the same prompt get spread across optimizer minibatches => weaker cancellation effect => lower performance

1:24 PM · May 15, 2026 · 1.2K Views
Researchers propose cancellation hypothesis for GRPO in LLM post-training · Digg