/Tech7h ago

Nathan Chen of Moonshot AI argues pretraining innovations make model architectures up to 10 times more compute efficient

Sparse attention can cut long-context FLOPS by 90 percent

92611513025.1K

Original post

elie@eliebakouch#1238inTech

> good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago

goes hard, and very true 🫡

nathan chen@nathancgy4

have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:

1. better computational efficiency: scalinglaw shifts, 2x less FLOPS needed to achieve the same loss, etc. plus e.g. long context settings where switching to hybrid or sparse attn can save you >90% FLOPS.

many model arch / optimizer improvements can save you >20% flops needed for the same loss - those are research innovations on every axis from training iter dimension to inter-layer and intra-layer. the effect of compounded architecture advantage is very distinctive given that ur always improving against your sota baseline.

good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago, and there's still obv many inefficiencies left to be optimized. over half of the compute is still spent on pretraining when you do new from-scratch model trainings rn, and having weeks & months saved there could really allow much more rapid iterations across the entire stack, compounded.

2. to train models one couldn't have been able to previously: residuals, optimizers, etc. this one's less common since most of the arch innovations don't offer more beyond the expressivity gain. but there are significant ones which can e.g. provide more stable learning dynamics (both theoretically and in practice) at all scales so one could scale up. new model configs or forms of training also come back to better efficiency

data/compute/FLOPS bottlenecks certainly exist but are relatively more orthogonal to pretrain research and imo it is unclear whether one will be a clear intelligence bottleneck a year from now than the other.

in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.

9:56 AM · Jun 8, 2026 · 5.4K Views

/Tech7h ago

Nathan Chen of Moonshot AI argues pretraining innovations make model architectures up to 10 times more compute efficient

Sparse attention can cut long-context FLOPS by 90 percent

92611513025.1K

#1238

Original post

elie@eliebakouch#1238inTech

> good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago

goes hard, and very true 🫡

nathan chen@nathancgy4

have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:

in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.

9:56 AM · Jun 8, 2026 · 5.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

No ranked X posts are available for this story yet.

Original post

elie@eliebakouch#1238inTech

> good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago

goes hard, and very true 🫡

nathan chen@nathancgy4

have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:

in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.

9:56 AM · Jun 8, 2026 · 5.4K Views