/AI2h ago

Moonshot AI's Nathan Chen argues pretraining research remains vital as architectural and optimizer improvements cut required FLOPS by over 20%

Sparse attention can reduce long-context FLOPS by over 90%.

7157105910.1K
Original postbilal#1366
nathan chen@nathancgy4

have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:

1. better computational efficiency: scalinglaw shifts, 2x less FLOPS needed to achieve the same loss, etc. plus e.g. long context settings where switching to hybrid or sparse attn can save you >90% FLOPS.

many model arch / optimizer improvements can save you >20% flops needed for the same loss - those are research innovations on every axis from training iter dimension to inter-layer and intra-layer. the effect of compounded architecture advantage is very distinctive given that ur always improving against your sota baseline.

good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago, and there's still obv many inefficiencies left to be optimized. over half of the compute is still spent on pretraining when you do new from-scratch model trainings rn, and having weeks & months saved there could really allow much more rapid iterations across the entire stack, compounded.

2. to train models one couldn't have been able to previously: residuals, optimizers, etc. this one's less common since most of the arch innovations don't offer more beyond the expressivity gain. but there are significant ones which can e.g. provide more stable learning dynamics (both theoretically and in practice) at all scales so one could scale up. new model configs or forms of training also come back to better efficiency

data/compute/FLOPS bottlenecks certainly exist but are relatively more orthogonal to pretrain research and imo it is unclear whether one will be a clear intelligence bottleneck a year from now than the other.

in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.

8:45 AM · Jun 8, 2026 · 7.7K Views
Sentiment

Users are excited about major efficiency gains in AI model pretraining because they find the recent improvements wild to see.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.4KBOOKMARKS9LIKES41RETWEETS1REPLIES2
elie@eliebakouch

> good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago

goes hard, and very true 🫡

nathan chen@nathancgy4

have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:

1. better computational efficiency: scalinglaw shifts, 2x less FLOPS needed to achieve the same loss, etc. plus e.g. long context settings where switching to hybrid or sparse attn can save you >90% FLOPS.

many model arch / optimizer improvements can save you >20% flops needed for the same loss - those are research innovations on every axis from training iter dimension to inter-layer and intra-layer. the effect of compounded architecture advantage is very distinctive given that ur always improving against your sota baseline.

good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago, and there's still obv many inefficiencies left to be optimized. over half of the compute is still spent on pretraining when you do new from-scratch model trainings rn, and having weeks & months saved there could really allow much more rapid iterations across the entire stack, compounded.

2. to train models one couldn't have been able to previously: residuals, optimizers, etc. this one's less common since most of the arch innovations don't offer more beyond the expressivity gain. but there are significant ones which can e.g. provide more stable learning dynamics (both theoretically and in practice) at all scales so one could scale up. new model configs or forms of training also come back to better efficiency

data/compute/FLOPS bottlenecks certainly exist but are relatively more orthogonal to pretrain research and imo it is unclear whether one will be a clear intelligence bottleneck a year from now than the other.

in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.

1hViews 2.4KLikes 41Bookmarks 9
elie@eliebakouch

@nathancgy4 totally agree with this 💯

1hViews 219Likes 3
elie@eliebakouch

@nathancgy4 (also realizing i'm an unc using this 💯 emoji quite a lot recently 😭)

1hViews 88Likes 2
Sachin@chsacy

@nathancgy4 @stochasticchasm architecture innovation primarily about this along with more of local computation.

1hViews 93
Alexander Long@AlexanderLong

@nathancgy4 So many things in pretraining not understood by anyone. Simple things like is CE even the right thing to optimize for... there's clearly massive improvements lurking there but it's too expensive for almost anyone to find out. need to pool resources somehow.

1hViews 58
nathan chen@nathancgy4

@eliebakouch no worries this 😭 balances things out

1hViews 11
Alex YGift@Radipdegen

@eliebakouch stating a reality check that most people just gloss over. actually hitting harder than most price predictions this week

1hViews 4
Strata@ChainZenit

@eliebakouch the efficiency gains lately are just wild to see.

13mViews 2