/Tech2h ago

Microsoft Paper Reveals Token Breakdown For Frontier Model Training

8129148822.2K

Original post

Yann LeCun@ylecun#8inTech

@natashajaques 😏

Natasha Jaques@natashajaques

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

I also thought this was pretty telling: - pre-training: 30 trillion tokens - mid-training (SFT on STEM/math/code data): 3.55 trillion tokens - RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy.

Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it's not the volume of learning signal, it's the optimization on top of an already reasonable predictive model.

4:56 AM · Jun 10, 2026 · 4.4K Views

/Tech2h ago

Microsoft Paper Reveals Token Breakdown For Frontier Model Training

8129148822.2K

Original post

Yann LeCun@ylecun#8inTech

@natashajaques 😏

Natasha Jaques@natashajaques

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

4:56 AM · Jun 10, 2026 · 4.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS144BOOKMARKS1

Adel Bucetta@adelbucetta

@natashajaques what i find fascinating is that even with publicly released data, the hparams and decisions made during pre-training and mid-training still feel opaque, like a black box we can't quite peek into

13h1441

RETWEETS6

Natasha Jaques@natashajaques

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

14h18.9K11987

AStar@world_peace_all

@ylecun @natashajaques Cakeology

2h12

Guddy@gudnessexpert

The token distribution is probably the most important signal in the paper. It reinforces that intelligence may emerge from massive world-model formation first, while RL acts more like a refinement layer that sharpens reasoning toward specific objectives rather than creating it from scratch.

What MAI-Thinking-1 seems to suggest is that prediction remains the foundation, but optimization determines how effectively that knowledge is deployed toward long-term goals.