/Tech2h ago

AI2 post-training lead Nathan Lambert argues that the core efficacy of both PPO and GRPO relies on policy gradient methods

Story Overview

In a recent online exchange, AI2 post-training lead Nathan Lambert cuts through the PPO versus GRPO debate by noting that both algorithms ultimately succeed because of their policy gradient foundations, a point he reinforces through his work on the RLHF Book amid growing interest in efficient RL methods for model post-training.

1218054011.2K

#80

Original post

Nathan Lambert@natolambert#80inTech

I was not ready for this PPO vs GRPO debate. Here we go again. The truth is just that policy gradient good.

6:38 AM · Jun 17, 2026 · 11.9K Views

Open Question

Where to Learn Policy Gradients

Replies to Lambert reveal builders seeking straightforward resources on these methods, underscoring that practical understanding still lags behind the rapid evolution of RL variants like GRPO.

Developer Impact

Memory Savings Without Losing Core Strengths

GRPO drops the separate critic model that PPO requires, trimming compute costs while preserving the policy gradient mechanics Lambert identifies as the real driver of results.

Sentiment

Positive users hail the PPO versus GRPO policy gradient debate as a generational moment, while negative users dismiss it as stupid because GRPO is viewed as a flawed algorithm limited to LLM training.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS647LIKES6REPLIES1

xlr8harder@xlr8harder

@natolambert Do you educate on this point anywhere because it seems like it would be useful

Nathan Lambert@natolambert

I was not ready for this PPO vs GRPO debate. Here we go again. The truth is just that policy gradient good.

1h64760

BOOKMARKS1

Nathan Lambert@natolambert

@hmmmmmstt if you're willing to chat with claude/gpt along the way you'll do fine with this book. If you want to focus only on the book, read sutton & barto first

54m5621

xlr8harder@xlr8harder

@natolambert @pleometric collab when

32m743

Mohamed Rashad@MRashadnow

@natolambert People will read an experiment from an ai lab and start announcing the death of stuff

1h2346

Nathan Lambert@natolambert

@xlr8harder https://www.youtube.com/shorts/AZLDj5H88B8

1h1532

xlr8harder@xlr8harder

@natolambert Do you have it in video short brainrot form?

1h1001

utd@hmmmmmstt

@natolambert Would it make sense to read RLHF as the first introduction to RL? Or does the book assume some base knowledge of RL for the reader?

59m87

Nathan Lambert@natolambert

@xlr8harder I wrote a book, not sure if that's good enough for people

1h614

Chris Nota@chris_nota_rl

I did kind of laugh when GRPO came out, but I don’t necessarily agree. The problem is that it is difficult to predict the value function from the prompt alone. GRPO is a hacky way of applying “hindsight.”

Think of the difficulty of the problem as line a hidden state. By attempting to solve the problem, you can actually learn about this hidden state. If you are careful, you can actually use this to retroactively predict the hidden state and use it to better predict the value function.

This is related to Rich Sutton’s point that partial observability and function approximation are two sides of the same coin.

GRPO kinda cheats and just estimated the value function directly.

The other thing to keep in mind is that there is essentially 0 environment feedback except for the final reward in most GRPO setups. So lambda < 1 doesn’t help much. So really the only thing that matters is the value of the initial state.

My 2 cents.

45m18

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@natolambert @xlr8harder /goal claude write a meta review on RLHF

27m16

Arip@machinestein

@natolambert All the same with a good pre training

1h173

Matt@Matthewagi

@natolambert These debates are fun because the narrative implies one or the other is the silver bullet. It's engineering tradeoffs all the way down

1h166

Joseph Suarez 🐡@jsuarez

@natolambert It's a stupid debate. GRPO is a bad algorithm, used because LLM people can't train value functions. Try GRPO in non-llm RL and it won't be close

56m106

🌌 ͜ʖ🌌@seldon_seen

@natolambert the powerhouse of tokens

32m38

Pleometric@pleometric

@xlr8harder @natolambert Pleometric was shot and killed in Chicago, Illinois

29m91

Eren | AI x Markets@ErenSignals

@natolambert Optimizer discourse gets loud fast. I care more about recipes that survive messy real tasks and boring production constraints.

1h27

Chris Nota@chris_nota_rl

@jsuarez @natolambert Agreed

35m25

Kashif Imteyaz@kashif_imteyaz

@natolambert @xlr8harder 🤣🤣🤣🤣

1h12

Nathan Lambert@natolambert

@Rafa_Schwinger @xlr8harder Im guessing this would be unbelievably shit, worth a shot tho

21m7

Joseph Suarez 🐡@jsuarez

I am not saying that LLM people can't train a value functions because they are idiots (that is a separate discussion!)... the problem is different. But when you have a problem where you CAN train a value fn, like virtually all of RL outside of LLMs, GRPO is immediately a very bad algorithm compared to virtually anything else you can do

42m7