5h ago

Ultra-FineWeb Paper Outlines Advanced Data Filtering and Annealing for LLMs

4503382.8K

——0——

Original post

The Ultra-FineWeb paper is a pretty exceptional manual for thinking about tiers of data, how to do quality filtering, and rephrasing. I especially like that they adapted the annealing technique to evaluate the data my team developed at Databricks. It sparks joy.

10:51 AM · May 30, 2026

#999Cody Blakeney@CODE_STAR

The hard part about pretraining data is that the signal you want can only be measured at relatively large scales. So you have to do these tricks (annealing/midtraining or CPT) to measure relative quality.

Cody Blakeney@code_star

5:51 PM · May 30, 2026 · 2.9K Views

5:51 PM · May 30, 2026 · 285 Views

#999Cody Blakeney@CODE_STAR

Now the problem is for either annealing or CPT you take on a set of priors you have to account for.

CPT tends to be way cheaper, but has more problems. (how do you warm the model back up, choose the replay data, what model, LR sweeps, etc)

Cody Blakeney@code_star

5:51 PM · May 30, 2026 · 285 Views

5:51 PM · May 30, 2026 · 247 Views