Stanford paper 'A Bitter Lesson for Data Filtering' presents scaling studies showing large language models tolerate and sometimes benefit from unfiltered low-quality data in high-compute pretraining regimes
No filter outperforms conventional methods at sufficient compute scales.
——0——
@Guodzh +1 on duplication, as cross entropy loss is extremely sensitive to duplicated data.
Interesting results, was debating with ppl on this many times. Duplication (multi-epoching or duplicated data) hurts training a lot more than some bad data for pretraining
4:56 AM · May 23, 2026 · 4.8K Views
6:06 AM · May 23, 2026 · 98 Views