Ultra-FineWeb Paper Outlines Advanced Data Filtering and Annealing for LLMs
The hard part about pretraining data is that the signal you want can only be measured at relatively large scales. So you have to do these tricks (annealing/midtraining or CPT) to measure relative quality.
The Ultra-FineWeb paper is a pretty exceptional manual for thinking about tiers of data, how to do quality filtering, and rephrasing. I especially like that they adapted the annealing technique to evaluate the data my team developed at Databricks. It sparks joy.
Now the problem is for either annealing or CPT you take on a set of priors you have to account for.
CPT tends to be way cheaper, but has more problems. (how do you warm the model back up, choose the replay data, what model, LR sweeps, etc)
The hard part about pretraining data is that the signal you want can only be measured at relatively large scales. So you have to do these tricks (annealing/midtraining or CPT) to measure relative quality.