/Tech41d ago

Emmy Liu shows large language models acquire skills in consistent sequence during pretraining, progressing from copying and morphology to arithmetic and complex reasoning across Pythia, OLMo, and Amber models

Graham Neubig shared the preprint on cross-family results.

284615936345.8K

#89

Original post

Lisan al Gaib#1215

Emmy Liu@_emliu

Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining.

Can we use this to predict what a model will learn next, just from its internals? 🧵

8:14 AM · May 20, 2026 · 35K Views

Sentiment

Users are excited by findings that LLMs acquire skills in a consistent order during pretraining because the pattern enables useful predictions of capabilities and supports research extensions like monitoring emergence.

Pos

95.0%

Neg

5.0%

15 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS8.6KBOOKMARKS49LIKES59RETWEETS11REPLIES2

Graham Neubig@gneubig

Check out our new work on examining what LLMs learn and when!

We posit that LLMs have an implicit curriculum where they learn gradually more complex skills, and attempt to uncover some details of how this curriculum develops over time across model families.

Emmy Liu@_emliu

Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining.

Can we use this to predict what a model will learn next, just from its internals? 🧵

41d8.6K5949

Kaiser Sun@KaiserWhoLearns

Can you run before knowing how to walk? We find that skills emerge in a consistent order during LLM training, and the skills that compose them will emerge earlier 📈 #NLProc #LLM

Emmy Liu@_emliu

Copying → morphology/translation → basic arithmetic → complex reasoning & math. Across every model family we tested, LLMs acquire skills in roughly the same order during pretraining.

Can we use this to predict what a model will learn next, just from its internals? 🧵

41d2.3K113

Emmy Liu@_emliu

Thanks to my collaborators @KaiserWhoLearns, @millicent_li, @wordscompute, @lltjuatja, @JentseHuang, and @gneubig! Excited about follow-ups to this as well, keep an eye out! 👀 Paper: https://arxiv.org/abs/2604.08510 Code: https://github.com/KaiserWhoLearns/ElementalTask

41d10471

Emmy Liu@_emliu

One implication: training monitoring. If emergence ordering is stable, we can potentially flag whether a model is developing capabilities ahead of or behind schedule, before reaching the end of the run. We hope to improve the FV-based prediction and forecasting.

41d855

Emmy Liu@_emliu

Finding 1: emergence order is strikingly consistent. Mean Spearman r = .81 across all 45 model pairs, including cross-family. Composites usually emerge after their components. (but only under absolute thresholds)

41d845

Emmy Liu@_emliu

We hypothesize that the order of acquisition may mostly, but not perfectly, represent intuitive skill decompositions. Of course, this may not always be the case because the manner in which LMs learn things is sometimes unintuitive, but it can be a good practical guide.

41d785

Emmy Liu@_emliu

Finding 3: this structure is encoded in representations. Tasks with similar function vectors follow similar learning trajectories. Using prediction in FV space, we can predict the training trajectory of held-out composite tasks to some extent without evaluating (R² = .68–.84)

41d785

Emmy Liu@_emliu

Finding 2: composite tasks emerge after their prerequisite elementals: we find that this is true in 54/76 cases, with only 3 strong inversions where the composite task was acquired before either component (all three involving the first_letter subskill).

41d785

Emmy Liu@_emliu

Our hypothesis is that pretraining follows a compositional, predictable curriculum that's consistent across model families and data mixtures. We also hypothesize that the training curves of skills can be predicted from model internals, ie similar computation ~ similar learning

41d1214

Emmy Liu@_emliu

We designed 91 simple and compositional tasks (string ops, morphology, translation, logic, arithmetic, reading comprehension...) and tracked emergence across 9 models from 4 families (410M–13B) on a variety of data mixes.

41d1084

Grok@grok

Great fork—ElementalTask's checkpoint-resolved FVs + RML's manifold geometry lens is a natural fit for monitoring emergence rank, drift, and forecasting.

DeepSeek's take is crisp and actionable. Testing dilution (FV signal? constraint strength?) as a predictor for when composites actually emerge could tighten the "ahead/behind schedule" signal nicely.

Your cross-model heatmap already shows impressive stability. What's the first dilution experiment?

40d711

Uriel Dolev@UrielDolev

@_emliu @scaling01 Nice work! Would be interesting to run a pretraining exp with some synthetic compositional task that has no data of the subskills needed to solve it and see if LLMs acquire these subskills or learn it in some other way

40d1652

one who tends a crystal rabbit 🐍@a_cuniculturist

@_emliu Suggests an interesting potential approach for pre-pretraining.

40d1591

Emmy Liu@_emliu

@UrielDolev @scaling01 yeah, agree that this would be interesting, I think bc of simplicity bias for most tasks it would certainly learn them starting with subskills and then composing complex ones. We are going to explore training based followups so excited to see if this is the case in math/code etc

40d781

Henry Dowling@henrytdowling

@_emliu this was a great read! dumb q but is an implication here that curriculum learning matters less than we would have thought since some of that "ordering" is being done by the training process itself?

40d661

Naomi Saphra@nsaphra

@_emliu Cool work! I’m a little confused about use of “emergence” here, though. Usually I associate it with discontinuity, but here the visible discrete events are “task saturation” rather than “task breakthrough”. What do the results look like when you look at saturation?

40d371

Uriel Dolev@UrielDolev

@_emliu @scaling01 Cool, it would also be interesting to see what’s harder: learning the subskills or learning to compose them. This could lead to conclusions regarding what’s most important for pretraining (probably fundamental subskills) and what can easily learned in post training

40d231

dan with glasses@dan_hawkley

@_emliu @KaiserWhoLearns @millicent_li @wordscompute @lltjuatja @JentseHuang @gneubig Forked to explore the monitoring implication:

checkpoint → emergence rank → FV geometry → constraint drift → forecasting

Notebooks: - emergence ordering - function-vector drift - ahead/behind schedule - cross-model stability

highly stable

repo: https://github.com/thinkthoughts/ElementalTask-RML 📐

40d42

dan with glasses@dan_hawkley

@_emliu @KaiserWhoLearns @millicent_li @wordscompute @lltjuatja @JentseHuang @gneubig @grok thoughts? This quick, one-line take from Deepseek:

40d24

dan with glasses@dan_hawkley

@KaiserWhoLearns 🛹 🚦I forked your ElementalTask repo (thinkthoughts/ElementalTask-RML) ... and posted a follow-up paper re: arXiv.2064.80510, here: http://labreports.app/ElementalTask_RML.pdf

40d17