I was not ready for this PPO vs GRPO debate. Here we go again. The truth is just that policy gradient good.
AI2 post-training lead Nathan Lambert argues that the core efficacy of both PPO and GRPO relies on policy gradient methods
Story Overview
In a recent online exchange, AI2 post-training lead Nathan Lambert cuts through the PPO versus GRPO debate by noting that both algorithms ultimately succeed because of their policy gradient foundations, a point he reinforces through his work on the RLHF Book amid growing interest in efficient RL methods for model post-training.
Where to Learn Policy Gradients
Replies to Lambert reveal builders seeking straightforward resources on these methods, underscoring that practical understanding still lags behind the rapid evolution of RL variants like GRPO.
Memory Savings Without Losing Core Strengths
GRPO drops the separate critic model that PPO requires, trimming compute costs while preserving the policy gradient mechanics Lambert identifies as the real driver of results.
Positive users hail the PPO versus GRPO policy gradient debate as a generational moment, while negative users dismiss it as stupid because GRPO is viewed as a flawed algorithm limited to LLM training.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@natolambert Do you educate on this point anywhere because it seems like it would be useful
I was not ready for this PPO vs GRPO debate. Here we go again. The truth is just that policy gradient good.

@hmmmmmstt if you're willing to chat with claude/gpt along the way you'll do fine with this book. If you want to focus only on the book, read sutton & barto first

@natolambert @pleometric collab when

@natolambert People will read an experiment from an ai lab and start announcing the death of stuff

@xlr8harder https://www.youtube.com/shorts/AZLDj5H88B8

@natolambert Do you have it in video short brainrot form?

@natolambert Would it make sense to read RLHF as the first introduction to RL? Or does the book assume some base knowledge of RL for the reader?

@xlr8harder I wrote a book, not sure if that's good enough for people

I did kind of laugh when GRPO came out, but I don’t necessarily agree. The problem is that it is difficult to predict the value function from the prompt alone. GRPO is a hacky way of applying “hindsight.”
Think of the difficulty of the problem as line a hidden state. By attempting to solve the problem, you can actually learn about this hidden state. If you are careful, you can actually use this to retroactively predict the hidden state and use it to better predict the value function.
This is related to Rich Sutton’s point that partial observability and function approximation are two sides of the same coin.
GRPO kinda cheats and just estimated the value function directly.
The other thing to keep in mind is that there is essentially 0 environment feedback except for the final reward in most GRPO setups. So lambda < 1 doesn’t help much. So really the only thing that matters is the value of the initial state.
My 2 cents.

@natolambert @xlr8harder /goal claude write a meta review on RLHF

@natolambert All the same with a good pre training

@natolambert These debates are fun because the narrative implies one or the other is the silver bullet. It's engineering tradeoffs all the way down

@natolambert It's a stupid debate. GRPO is a bad algorithm, used because LLM people can't train value functions. Try GRPO in non-llm RL and it won't be close

@natolambert the powerhouse of tokens

@xlr8harder @natolambert Pleometric was shot and killed in Chicago, Illinois

@natolambert Optimizer discourse gets loud fast. I care more about recipes that survive messy real tasks and boring production constraints.

@jsuarez @natolambert Agreed

@natolambert @xlr8harder 🤣🤣🤣🤣

@Rafa_Schwinger @xlr8harder Im guessing this would be unbelievably shit, worth a shot tho

I am not saying that LLM people can't train a value functions because they are idiots (that is a separate discussion!)... the problem is different. But when you have a problem where you CAN train a value fn, like virtually all of RL outside of LLMs, GRPO is immediately a very bad algorithm compared to virtually anything else you can do