1d ago

Researchers propose cancellation hypothesis for GRPO in LLM post-training

35856110.0K

——0——

Researchers proposed the cancellation hypothesis to explain the effectiveness of critic-free reinforcement learning methods such as GRPO in large language model post-training. The hypothesis states that sequence-level rewards produce implicit token-level credit assignment through gradient cancellation from positive and negative rollouts. Ivan Titov, professor of Natural Language Processing at the University of Edinburgh and University of Amsterdam, noted that batching choices in GRPO implementations weaken this cancellation effect and reduce performance when rollouts for the same prompt span multiple minibatches.

Original post

#770@PMINERVINI @PONTIEDOARDO

Edoardo Ponti@PONTIEDOARDO

Critic-free RL (e.g. GRPO) is very effective in LLM post-training, but why? We propose the💥cancellation hypothesis💥: sequence-level rewards implicitly assign credits to individual tokens through the cancellation of gradients from pos/neg rollouts.

5:28 AM · May 15, 2026

Cluster engagement

49 snapshots

Reposted by

#844@NOUHADZIRI

#770@PMINERVINI

QUOTE POST

#652Ivan Titov@IATITOV

Very happy to see this out - great work led by @crazycth0901 and @ZeroyuHuang ! One takeaway: batching in GRPO is not a minor detail. In many implementations, rollouts for the same prompt get spread across optimizer minibatches => weaker cancellation effect => lower performance

1:24 PM · May 15, 2026 · 1.2K Views