/Tech1h ago

Prime Intellect's Elie Bakouch says GLM5.2 reintroduces critic models, prompting debate on PPO value estimation trade-offs

Engineer @kalomaze argues the method risks spurious token relationships

1920376212.1K

#403

Original post

hallerite@hallerite

GLM5.2 brings back the critic.

It was just a matter of time until we people would realize that group-based variance reduction is unfeasible after some horizon length. We need to be more fine-grained. I am sure OAI and Ant have been using value models for quite some time.

12:40 PM · Jun 16, 2026 · 8K Views

Sentiment

Positive users praise GLM5.2 for reintroducing critic models that improve continual training, while negative users dismiss the revival as typical paper hype that fails to change real shipped systems.

Pos

33.4%

Neg

66.6%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.8KBOOKMARKS10LIKES53REPLIES3

Lisan al Gaib@scaling01

yeah there have been rumors for a long time that none of the frontier labs actually use GRPO

hallerite@hallerite

GLM5.2 brings back the critic.

40m3.8K5310

RETWEETS1

kalomaze@kalomaze

in general i think the value estimation question is underspecified without the broader systems engineering context that your particular flavor of policy gradient is embedded in, i.e it's not as close to being a pure research project as people like to treat it as

kalomaze@kalomaze

to be fair, explicit discrimination is a meaningfully different objective. tho i still think more Weight is put on the idea of explicit value estimation as a hard necessity rather than what is, at its worst, "trade variance for spurious relationships with this one weird trick!"

1h63893

kalomaze@kalomaze

aria /ɔˈreːliəm/@ariaurelium

finally, a reason to get back to my red-string-corkboard about "PPO literally cannot be doing what the 'critic' metaphor implies it is doing"

"what token is most likely to get me to the right answer at the end" is the policy model's objective!

1h1.8K135

hallerite@hallerite

It's the same principle behind ECHO (SFTing on tool-call outputs), if you understand RL in an information theoretic way. We simply need to get as much out of every rollout as possible. Naive GRPO is unbelievably wasteful.

2h25513

hallerite@hallerite

well done @Zai_org

2h2027

kalomaze@kalomaze

@siddarthv66 @hallerite i mean sans us permitting the use of branching, yes, some form of discriminative head bolted on top was always gonna get you better returns for long horizons i just think people spent too much time focused on value estimation in the past in spite of larger *immediate* bottlenecks

2h332

Siddarth@siddarthv66

@hallerite @kalomaze is a hater

2h424

kache@yacineMTB

@kalomaze Do people in the language model use a seperate net for the critic or is it the same branch for actor + critic

kalomaze@kalomaze

basically my intuition is that branched group pg variants already exist as *alternative* ways to naturally get pg to behave better in long horizons without explicit value estimation, and afaic we have no strong evidence for if GLM tried to go this route and failed

53m9030

hallerite@hallerite

@Zai_org

1h972

hallerite@hallerite

@Zai_org On Information Theory

1h802

Ilija Lichkovski@carnot_cyclist

@hallerite nice! group size 1 unlocks a new world, much better suited for continual training over production traces

1h233

hallerite@hallerite

@Zai_org

1h722

hallerite@hallerite

@Zai_org

1h702

hallerite@hallerite

@Zai_org

1h652

hallerite@hallerite

@Zai_org

1h632

kalomaze@kalomaze

@siddarthv66 @hallerite i still think i'm not a huge fan of absolute regression to estimate value via MSE in the PPO context, and how pairwise or weighed pairwise formulations for a critic might wind up representationally better

1h292

Siddarth@siddarthv66

@kalomaze @hallerite If segments between value estimation are long enough, true scale pilled version is to have reasoning value functions (shared policy weights obviously), trained with RL using neg TD loss as reward. We’re gonna need absurdly long horizons before anything like this is flop optimal

1h272

hallerite@hallerite

@kalomaze @siddarthv66 you can probably also do without a head bolted on top

1h272

kalomaze@kalomaze

@yacineMTB it's overwhelmingly the shared trunk + lightweight head trained on top thing i believe if you asked me to roll a PPO implementation into prime-rl tomorrow, i'd first reach for a trainer only LoRA critic path also by "branched" i meant "branched trajectories with shared prefixes"

kache@yacineMTB

@kalomaze Do people in the language model use a seperate net for the critic or is it the same branch for actor + critic

45m8110

hallerite@hallerite

@Zai_org

1h653