Pre-training is increasingly data-constrained: compute outruns text, models repeat tokens many times, and how much repetition you can afford is an open question. In "Mix, Don't Tune" š¶ (my @Apple MLR internship), we run ~1000 pre-training runs from 150M to 1.43B params with full HP grids at every scale, to figure out what actually drives performance when target-language data is scarce, and land on a concrete recipe for the data-constrained regime. (1/3)
š: https://arxiv.org/abs/2605.13225