5h ago

Ultra-FineWeb Paper Outlines Advanced Data Filtering and Annealing for LLMs

0
Original post

The Ultra-FineWeb paper is a pretty exceptional manual for thinking about tiers of data, how to do quality filtering, and rephrasing. I especially like that they adapted the annealing technique to evaluate the data my team developed at Databricks. It sparks joy.

10:51 AM · May 30, 2026 View on X

The hard part about pretraining data is that the signal you want can only be measured at relatively large scales. So you have to do these tricks (annealing/midtraining or CPT) to measure relative quality.

Cody BlakeneyCody Blakeney@code_star

The Ultra-FineWeb paper is a pretty exceptional manual for thinking about tiers of data, how to do quality filtering, and rephrasing. I especially like that they adapted the annealing technique to evaluate the data my team developed at Databricks. It sparks joy.

5:51 PM · May 30, 2026 · 2.9K Views
5:51 PM · May 30, 2026 · 285 Views

Now the problem is for either annealing or CPT you take on a set of priors you have to account for.

CPT tends to be way cheaper, but has more problems. (how do you warm the model back up, choose the replay data, what model, LR sweeps, etc)

Cody BlakeneyCody Blakeney@code_star

The hard part about pretraining data is that the signal you want can only be measured at relatively large scales. So you have to do these tricks (annealing/midtraining or CPT) to measure relative quality.

5:51 PM · May 30, 2026 · 285 Views
5:51 PM · May 30, 2026 · 247 Views