11h ago

Stanford paper 'A Bitter Lesson for Data Filtering' presents scaling studies showing large language models tolerate and sometimes benefit from unfiltered low-quality data in high-compute pretraining regimes

No filter outperforms conventional methods at sufficient compute scales.

0
Original post

@sytelus This makes a ton more sense. I was struggling with reconciling this result with what we see in bioML models.

12:36 PM · May 22, 2026 View on X

@Guodzh +1 on duplication, as cross entropy loss is extremely sensitive to duplicated data.

Guodong ZhangGuodong Zhang@Guodzh

Interesting results, was debating with ppl on this many times. Duplication (multi-epoching or duplicated data) hurts training a lot more than some bad data for pretraining

4:56 AM · May 23, 2026 · 4.8K Views
6:06 AM · May 23, 2026 · 98 Views
Stanford paper 'A Bitter Lesson for Data Filtering' presents scaling studies showing large language models tolerate and sometimes benefit from unfiltered low-quality data in high-compute pretraining regimes · Digg