Researchers propose cancellation hypothesis for GRPO in LLM post-training
Researchers proposed the cancellation hypothesis to explain the effectiveness of critic-free reinforcement learning methods such as GRPO in large language model post-training. The hypothesis states that sequence-level rewards produce implicit token-level credit assignment through gradient cancellation from positive and negative rollouts. Ivan Titov, professor of Natural Language Processing at the University of Edinburgh and University of Amsterdam, noted that batching choices in GRPO implementations weaken this cancellation effect and reduce performance when rollouts for the same prompt span multiple minibatches.
Very happy to see this out - great work led by @crazycth0901 and @ZeroyuHuang ! One takeaway: batching in GRPO is not a minor detail. In many implementations, rollouts for the same prompt get spread across optimizer minibatches => weaker cancellation effect => lower performance