/AI2h ago

Microsoft Paper Reveals Token Breakdown For Frontier Model Training

9136159523.4K
Original post
Yann LeCun@ylecun#4inAI

@natashajaques 馃槒

Natasha Jaques@natashajaques

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

I also thought this was pretty telling: - pre-training: 30 trillion tokens - mid-training (SFT on STEM/math/code data): 3.55 trillion tokens - RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy.

Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it's not the volume of learning signal, it's the optimization on top of an already reasonable predictive model.

4:56 AM 路 Jun 10, 2026 路 4.4K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS144BOOKMARKS1
Adel Bucetta@adelbucetta

@natashajaques what i find fascinating is that even with publicly released data, the hparams and decisions made during pre-training and mid-training still feel opaque, like a black box we can't quite peek into

13hViews 144Bookmarks 1
RETWEETS6
Natasha Jaques@natashajaques

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

I also thought this was pretty telling: - pre-training: 30 trillion tokens - mid-training (SFT on STEM/math/code data): 3.55 trillion tokens - RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy.

Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it's not the volume of learning signal, it's the optimization on top of an already reasonable predictive model.

14hViews 18.9KLikes 119Bookmarks 87
AStar@world_peace_all

@ylecun @natashajaques Cakeology

2hViews 12
Guddy@gudnessexpert

The token distribution is probably the most important signal in the paper. It reinforces that intelligence may emerge from massive world-model formation first, while RL acts more like a refinement layer that sharpens reasoning toward specific objectives rather than creating it from scratch.

What MAI-Thinking-1 seems to suggest is that prediction remains the foundation, but optimization determines how effectively that knowledge is deployed toward long-term goals.

2h