
Plenty of open questions left about how to choose mixtures for a given scale, transfer to real pretraining, post-training as a separate axes, etc, but this paper lays solid foundations to think about all of these. Link: https://arxiv.org/abs/2605.29548