ICML paper finds internal data repetition during LLM pretraining can waste up to 33% of compute resources

VIEWS139REPLIES1

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Paper: https://arxiv.org/abs/2606.24998.

With @JoshuaK92829 (co-lead), Noam Levi (co-lead), @RylanSchaeffer, @yegordb, Bo He, Mehmet Donmez, @sanmikoyejo, and David Donoho.

3h1393

LIKES3

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

@JoshuaK92829 @RylanSchaeffer @yegordb @sanmikoyejo With @stai_research @StanfordAILab @stanfordnlp

3h783

RETWEETS6

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Flying to #ICML2026 to present Internal Data Repetition Destroys Language Models, an Oral at Foundations of Deep Gen Models Workshop!

Paper: https://arxiv.org/abs/2606.24998

You might be curious to know what we mean by “destroys”! Pretraining is now data-constrained, and even aggressively deduplicated corpora keep some repetition. We measured what that repetition actually costs in the currency practitioners care about: compute. The answer, in the worst case, is a third of your FLOPs.

3h8.1K227

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Deduplication is standard practice but imperfect, and the harm from the residual repetition has been hard to measure. We apply a compute-equivalent measure to assess repetition damage: the amount of FLOPs a repetition free run would need to match the same loss.

3h902

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The location of that peak is also predictable: it follows a clean trend in model size, with larger models taking their worst hit from fewer repeats of larger pools. As models scale up, repetition structures that used to be harmless move into the harmful regime; our hypothesis is that memorization capacity grows faster than compute.

3h852

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

In our setup, we always let repeats eat the same 10% of training tokens, and only changed their shape. Is it worse to repeat a lot of documents a few times, or a few documents a lot of times? The arrangement matters enormously...

3h822

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Namely, the damage consistently peaks at an intermediate repeat count. A small pool repeated very many times gets memorized without much collateral harm, and a large pool repeated a few times behaves almost like unique data. The harmful regime lies in between, where the repeated set is big enough to influence what the model learns and frequent enough to skew it toward the copies. This intermediate peak was first observed by Hernandez et al. in 2022, and we confirm it survives under compute-optimal budgets.

3h782

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The harm of a worst case scenario is very severe: it wastes 33% of your compute - HUGE!!! Reading loss curves alone would understate the cost of repetition.

3h532

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

This pattern seems far more fundamental and has little dependence on architecture. We show that even the simplest setting of misspecified linear regression with duplicated rows produces the same intermediate peak, which we derive in closed form. When the number of repeats is small, the duplicated rows carry little weight relative to the unique data, and the model generalizes well. When the number of repeats is large, the model fully memorizes the small repeated pool, confining the damage to the few directions it spans while the unique data fix the rest. The peak sits in between, where the repeated and unique data carry comparable weight and the model partially fits a systematic error it can neither average out nor isolate.

3h492

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

I'm especially grateful to @JoshuaK92829, whose vision and mentorship made this project possible; working on it was an amazing experience, and I learned so much from him and the team.

3h462

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

As pretraining becomes more data-constrained, repetition becomes harder to avoid. And its cost isn't captured by the fraction of data that's duplicated: at the same fraction, the repeat structure alone can consume a meaningful share of a run's compute. Our findings add precision to the study of duplication in language models, quantifying the wasted compute incurred by both the presence and the repeat structure of duplicates.

3h412

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The practical implication is that the fraction of a corpus that is duplicated is an incomplete measure of risk. Two corpora with identical duplication rates can waste very different amounts of compute depending on how the duplicates are concentrated.

3h412

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Paper: https://arxiv.org/abs/2606.24998.

With @JoshuaK92829 (co-lead), Noam Levi (co-lead), @RylanSchaeffer, @yegordb, Bo He, Mehmet Donmez, @sanmikoyejo, and David Donoho.

3h20040

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

@JoshuaK92829 @RylanSchaeffer @yegordb @sanmikoyejo With @stai_research @StanfordAILab @stanfordnlp

3h13240

Diya Sabharwal@diyasabh

@jchudnov Super interesting Jessica. Would love to chat more about this when you’re back

2h361

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

As pretraining becomes more data-constrained, repetition becomes harder to avoid. And its cost isn't captured by the fraction of data that's duplicated: at the same fraction, the repeat structure alone can consume a meaningful share of a run's compute. Our findings add precision to the study of duplication in language models, quantifying the wasted compute incurred by both the presence and the repeat structure of duplicates.

3h6720

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The harm of a worst case scenario is very severe: it wastes 33% of your compute - HUGE!!! Reading loss curves alone would understate the cost of repetition.

3h8020

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The practical implication is that the fraction of a corpus that is duplicated is an incomplete measure of risk. Two corpora with identical duplication rates can waste very different amounts of compute depending on how the duplicates are concentrated.

3h6720

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Deduplication is standard practice but imperfect, and the harm from the residual repetition has been hard to measure. We apply a compute-equivalent measure to assess repetition damage: the amount of FLOPs a repetition free run would need to match the same loss.

3h13520

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

This pattern seems far more fundamental and has little dependence on architecture. We show that even the simplest setting of misspecified linear regression with duplicated rows produces the same intermediate peak, which we derive in closed form. When the number of repeats is small, the duplicated rows carry little weight relative to the unique data, and the model generalizes well. When the number of repeats is large, the model fully memorizes the small repeated pool, confining the damage to the few directions it spans while the unique data fix the rest. The peak sits in between, where the repeated and unique data carry comparable weight and the model partially fits a systematic error it can neither average out nor isolate.

3h9620