11h ago

Stanford paper 'A Bitter Lesson for Data Filtering' presents scaling studies showing large language models tolerate and sometimes benefit from unfiltered low-quality data in high-compute pretraining regimes

No filter outperforms conventional methods at sufficient compute scales.

1401383

——0——

Original post

#1675Anshul Kundaje@ANSHULKUNDAJE

@sytelus This makes a ton more sense. I was struggling with reconciling this result with what we see in bioML models.

12:36 PM · May 22, 2026

#1158Banghua Zhu@BANGHUAZ

@Guodzh +1 on duplication, as cross entropy loss is extremely sensitive to duplicated data.

Guodong Zhang@Guodzh

Interesting results, was debating with ppl on this many times. Duplication (multi-epoching or duplicated data) hurts training a lot more than some bad data for pretraining

4:56 AM · May 23, 2026 · 4.8K Views

6:06 AM · May 23, 2026 · 98 Views

Stanford paper 'A Bitter Lesson for Data Filtering' presents scaling studies showing large language models tolerate and sometimes benefit from unfiltered low-quality data in high-compute pretraining regimes

Cluster engagement

Sentiment