/Tech10h ago

Google DeepMind's Alex Imas says GLM 5.2 uses distillation solely to cold-start reinforcement learning on coding tasks

The model then transitions to pure RL hill-climbing.

1342.3K2271.3K383.2K

#102

Original post

rohan anil@_arohan_#102inTech

This is indeed how it works except I don’t know who distills who tbh.

Tounge in cheek part of me around science wants to say, if you add deepseek sparse attention to your frontier model architecture - are you distilling?

Patrick C Toulme@PatrickToulme

There’s a big misconception about how GLM 5.2 was trained. Yes, they distilled Claude and GPT 5.5 — but distillation is not how they matched Opus quality. Distillation only fixed the cold start problem in RL.

RLing an agentic coding model isn’t rocket science. In simplified terms:

1. RL needs trajectories — rollouts where the model actually completed a task in some env

2. No successful trajectory on a task = zero gradient = you can’t RL it. This is the cold start problem

3. Distillation solves it. You seed your model with knowledge from a smarter one (Claude, GPT) on tasks it can’t do yet

4. Now it produces positive trajectories on those tasks

5. RL on those trajectories and hill climb agentic coding

6. At that point you no longer need to distill and can solely hill climb RL to better models

This is an interesting curve. I’d argue it’s harder to get to Opus 4.8 from scratch than to go from Opus 4.8 → Fable/Mythos tier.

GLM 5.2 is already producing positive trajectories, so they have plenty to RL on — they’ll keep climbing to Mythos quality without distilling any further. They no longer need American models.

9:00 AM · Jun 24, 2026 · 13.7K Views

Sentiment

Some users view GLM 5.2's distillation method as an effective jumpstart enabling self-sufficient model progress, while others accuse it of theft from competitors and criticize the term as vague or overloaded.

Pos

42.5%

Neg

57.5%

14 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS10.5KBOOKMARKS9LIKES96

Rohan@proxy_vector

@PatrickToulme I think this is where people blur bootstrapping with convergence. Distillation can get you into the neighborhood faster, but the question is what actually teaches the model to recover, plan, and stay coherent over long horizons.

1d10.5K969

RETWEETS225

Patrick C Toulme@PatrickToulme

RLing an agentic coding model isn’t rocket science. In simplified terms:

1. RL needs trajectories — rollouts where the model actually completed a task in some env

2. No successful trajectory on a task = zero gradient = you can’t RL it. This is the cold start problem

3. Distillation solves it. You seed your model with knowledge from a smarter one (Claude, GPT) on tasks it can’t do yet

4. Now it produces positive trajectories on those tasks

5. RL on those trajectories and hill climb agentic coding

6. At that point you no longer need to distill and can solely hill climb RL to better models

This is an interesting curve. I’d argue it’s harder to get to Opus 4.8 from scratch than to go from Opus 4.8 → Fable/Mythos tier.

GLM 5.2 is already producing positive trajectories, so they have plenty to RL on — they’ll keep climbing to Mythos quality without distilling any further. They no longer need American models.

2d372.2K2.2K1.3K

REPLIES4

Robert W Q Brown@RWQBrown

Its not a misconception, they stole from Anthropic and OpenAI to create their models. This would be a different conversation if they distilled another open source model and were able to get near SOTA results. Its stealing, point blank. If I were Anthropic and OpenA, I wouldn’t allow my models to be used in China.

1d2.1K1

Morgan@morganlinton

@PatrickToulme But GLM 5.2 does not match Opus quality, it benchmarked way below Opus.

It's a great model, and the price is awesome, but it benchmarked slightly below GPT 5.4.

1d3.8K172

antirez@antirez

With R0, while the result was not stable, DeepSeek provided evidence that RL can work even with a cold start, but I understand it is hardly optimal, and that having strong reasoning patterns is very useful to get better RL signal. But if RL and experts-distillation (DeepSeek v4 / GLM 5.2 say they trained domain specific experts) will do the work, why you can't SFT initially with DeepSeek v3.2 instead of using American models? It could be already enough to go over the cold start.

1d3.5K202

Patrick C Toulme@PatrickToulme

@proxy_vector +1 exactly right

1d8.6K291

Patrick C Toulme@PatrickToulme

@morganlinton Ive seen it closer on other benchmarks. In my own testing - it is in same tier as Opus. Sure i still would take Claude over it

1d2.5K34

Kevin S. Xu@kevinsxu

@PatrickToulme Great post, highly informative, thank you.

If distillation does get banned effectively (i know easier said than done, but executive orders/legislations are all moving toward it), what are other ways to get around the cold start problem of RL?

1d98841

Janek Mann@janekm

@PatrickToulme @Lunexalith Yes, I would think so too, especially if they’re willing to use some human labour for QA of the environments. (Putting a lot of effort into human labelling is reportedly how Seedance 2 got so good… I wouldn’t be surprised if we saw a similar breakout with LLMs one day)

1d75932

Patrick C Toulme@PatrickToulme

@janekm @Lunexalith For sure they did. GLM 5.2 IMO is good enough for them to continue to train assuming they lost all access to USA models which is very unlikely to begin with

1d4.8K13

Damir Wallener 🇭🇷🇨🇦…🚀🛰️…⚽️🥁…👨‍🍳@DamirWallener

@PatrickToulme All training is distillation. Human, silicon, doesn’t matter…it’s all distillation.

And that’s how we move forward.

1d1.9K13

Jim Liu@jiahanjimliu

@PatrickToulme RL is highly local and often plateaus. There’s a reason why everyone is trying to get more GPUs for a large pre-training run.

1d2.2K81

Nathan Lambert@natolambert

@kevinsxu @PatrickToulme If it's not possible, then you need to create even better expert models and distill from there.

1d24541

Janek Mann@janekm

@PatrickToulme @Lunexalith I wouldn’t be surprised if they also used strong models to help build their RL tasks, they describe an automated pipeline for it. Dario would probably call that distillation too but I think that really stretches the term 😅 (against Claude ToS, but…)

1d5.8K5

Chocobo@chocobo2837

@PatrickToulme We're saying hard distillation here right? Since the logits aren't returned from these APIs... So hard distillation as mid-training/SFT data then start your RL?

1d1.8K21

ekello@ekello

@morganlinton @PatrickToulme I tested GLM 5.2 for fiction writing, and there might be better vibes coming from that direction. The frontier models just are not that good at fiction writing anymore.

20h31321

Compute King@Compute_King

@PatrickToulme Hmm, a fast path to Fable 5 quality by end of this year?

Why not Mistrals do that? They could base on GLM5.2 and RL.

1d1.8K7

Robert W Q Brown@RWQBrown

@4C4F36 @PatrickToulme That is a blatant lie, there is zero credible evidence to suggest Google did such a thing.

1d285

Pan Anon@therealpananon

@PatrickToulme They can RL GLM 5.2 all they want. The base model isn't big enough to ever compete with mythos, which is currently very undertrained given its size. All anthropic has to do is keep training and their lead will grow.

1d1.7K6

rohan anil@_arohan_

@PandaAshwinee Are you distilling me right now?

8h3677