/Tech1h ago

GLM-5.2 reintroduces critic and value models to reinforcement learning, moving away from GRPO for long-horizon tasks

Story Overview

Z.ai updated GLM-5.2's agentic reinforcement learning setup for long-horizon work by restoring a critic-based approach that estimates token-level advantages on individual rollouts, stepping back from the group-relative comparisons used in GRPO.

2944631261100.7K

#90

Original post

hallerite@hallerite

GLM5.2 brings back the critic.

It was just a matter of time until we people would realize that group-based variance reduction is unfeasible after some horizon length. We need to be more fine-grained. I am sure OAI and Ant have been using value models for quite some time.

12:40 PM · Jun 16, 2026 · 99.7K Views

Technical Shift

Group comparisons lose steam on variable traces

Long tasks produce compacted sub-traces of uneven lengths, which undercuts GRPO's reliance on comparing outputs within fixed groups and leaves some data unusable.

Open Question

Critic models bring token-level flexibility back

The new setup scores each trajectory on its own, naturally handling length differences and raising open questions about how widely other labs will follow for complex agent workflows.

Sentiment

Positive users praised GLM-5.2 for reintroducing critic and value models because they enable superior token-level credit assignment and continual training compared to GRPO.

Pos

91.7%

Neg

8.3%

14 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.5KBOOKMARKS15LIKES68RETWEETS1

hallerite@hallerite

It's the same principle behind ECHO (SFTing on tool-call outputs), if you understand RL in an information theoretic way. We simply need to get as much out of every rollout as possible. Naive GRPO is unbelievably wasteful.

15h5.5K6815

REPLIES2

Ilija Lichkovski@carnot_cyclist

@hallerite nice! group size 1 unlocks a new world, much better suited for continual training over production traces

14h841121

hallerite@hallerite

@Zai_org

14h4.4K183

hallerite@hallerite

well done @Zai_org

15h3.9K19

Ariel@redtachyon

@hallerite Any clue if they did a value head or a whole separate value model?

11h60171

hallerite@hallerite

@Zai_org On Information Theory

14h1.7K61

hallerite@hallerite

@Zai_org

14h3.1K7

hallerite@hallerite

@carnot_cyclist 💯

14h76831

hallerite@hallerite

@Zai_org

14h1.7K6

hallerite@hallerite

@Zai_org

14h2.4K5

The Nurse Engineer🇳🇬@boochi_dot_dev

I did RL simulations on a recommendation algorithm in one of my projects, where I started out trying GRPO Vs actor-critic. And holy smokes, actor-critic out performed GRPO by over 15%.

From my observations: GRPO mostly teaches models how to reach an end goal (even when its intermediate steps are clearly wrong or sub-optimal). Actor-Critic methods guides the policy to master step-by-step optimal strategies to reaching an end goal (this makes the policy less prone to hallucinations and even learn faster)

13h29421

hallerite@hallerite

@Zai_org

14h2K4

hallerite@hallerite

@redtachyon no clue. I think a value head is more likely than a whole separate model, but maybe it's even simpler than that

11h5059

kalomaze@kalomaze

@siddarthv66 @hallerite i mean sans us permitting the use of branching, yes, some form of discriminative head bolted on top was always gonna get you better returns for long horizons i just think people spent too much time focused on value estimation in the past in spite of larger *immediate* bottlenecks

15h662

Shannon Sands@max_paperclips

@hallerite I went back to PPO for games a bit ago. and yeah, I doubt OAI or ANT did GRPO. it's just the long horizon problem

10h4098

Siddarth@siddarthv66

@hallerite @kalomaze is a hater

15h834

hallerite@hallerite

@Zai_org

14h1.4K5

kalomaze@kalomaze

@siddarthv66 @hallerite i still think i'm not a huge fan of absolute regression to estimate value via MSE in the PPO context, and how pairwise or weighed pairwise formulations for a critic might wind up representationally better

15h652

hallerite@hallerite

@kalomaze @siddarthv66 you can probably also do without a head bolted on top

14h522

Siddarth@siddarthv66

@kalomaze @hallerite If segments between value estimation are long enough, true scale pilled version is to have reasoning value functions (shared policy weights obviously), trained with RL using neg TD loss as reward. We’re gonna need absurdly long horizons before anything like this is flop optimal

14h512