Data repetition is known to be harmful for LLM pretraining . @jchudnov 's paper shows the harms depend on a scaling-predictable interaction btwn model parameters, number of repeated docs, and number of repeats
The wrong combination eviscerates compute - as much as 33% wasted!!!
Flying to #ICML2026 to present Internal Data Repetition Destroys Language Models, an Oral at Foundations of Deep Gen Models Workshop!
Paper: https://arxiv.org/abs/2606.24998
You might be curious to know what we mean by “destroys”! Pretraining is now data-constrained, and even aggressively deduplicated corpora keep some repetition. We measured what that repetition actually costs in the currency practitioners care about: compute. The answer, in the worst case, is a third of your FLOPs.

