1h ago

Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

Work targets abundant compute and limited data pretraining scenarios.

0
Original post

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

8:51 AM · May 21, 2026 View on X

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views
3:51 PM · May 21, 2026 · 1K Views

This is the case even for more targeted “bad” data, like shuffling the tokens or even injecting completely random token sequences. We find that adding in very large amounts of fresh “bad” data doesn't hurt (and even helps, for shuffled data), compared to more epoching.

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

3:51 PM · May 21, 2026 · 1K Views
3:51 PM · May 21, 2026 · 828 Views

@tatsu_hashimoto @ChengleiSi you lost me at "with enough compute"

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views
4:30 PM · May 21, 2026 · 263 Views

Things are weird in the (severely) data-constrained regime. Tatsu is always thinking far ahead about the future!

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views
4:29 PM · May 21, 2026 · 976 Views
Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes · Digg