Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

QUOTE POST

@tatsu_hashimoto @giffmana TheRightWay™ strikes again?

5:12 PM · May 21, 2026 · 173 Views

REPLY

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views

3:51 PM · May 21, 2026 · 1K Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

This is the case even for more targeted “bad” data, like shuffling the tokens or even injecting completely random token sequences. We find that adding in very large amounts of fresh “bad” data doesn't hurt (and even helps, for shuffled data), compared to more epoching.

Tatsunori Hashimoto@tatsu_hashimoto

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

3:51 PM · May 21, 2026 · 1K Views

3:51 PM · May 21, 2026 · 828 Views

REPLY

#437Xiao Ma@INFOXIAO

@tatsu_hashimoto @ChengleiSi you lost me at "with enough compute"

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views

4:30 PM · May 21, 2026 · 263 Views

QUOTE POST

#442CLS@CHENGLEISI

Things are weird in the (severely) data-constrained regime. Tatsu is always thinking far ahead about the future!

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 12.5K Views

4:29 PM · May 21, 2026 · 976 Views

Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

Sentiment

Cluster engagement