2h ago

Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

Work targets abundant compute and limited data pretraining scenarios.

0
Original post

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

8:51 AM · May 21, 2026 View on X

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 16.2K Views
3:51 PM · May 21, 2026 · 1.2K Views

This is the case even for more targeted “bad” data, like shuffling the tokens or even injecting completely random token sequences. We find that adding in very large amounts of fresh “bad” data doesn't hurt (and even helps, for shuffled data), compared to more epoching.

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

3:51 PM · May 21, 2026 · 1.2K Views
3:51 PM · May 21, 2026 · 1K Views

@sainingxie @giffmana This set of results would argue even more strongly for things like NoFilter, in that in sufficiently high compute regimes for LMs, you may actually get benefits on the majority group (the 'high quality filtered set' we eval on) by including the minority group.

Saining XieSaining Xie@sainingxie

@tatsu_hashimoto @giffmana TheRightWay™ strikes again?

5:12 PM · May 21, 2026 · 419 Views
5:48 PM · May 21, 2026 · 10 Views

@tatsu_hashimoto @ChengleiSi you lost me at "with enough compute"

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 16.2K Views
4:30 PM · May 21, 2026 · 334 Views

Things are weird in the (severely) data-constrained regime. Tatsu is always thinking far ahead about the future!

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 16.2K Views
4:29 PM · May 21, 2026 · 1.1K Views

Multi-epoch pre-training should be the default setting for pre-training papers

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 16.2K Views
5:40 PM · May 21, 2026 · 145 Views

@tatsu_hashimoto Very cool. I am now expecting a flood of models on AI x bio that try to do the same thing (TBF they are already largely doing this with little success), without realizing at what scale & problem definitions this actually works. At least, I know who to blame.😆

Tatsunori HashimotoTatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 16.2K Views
5:44 PM · May 21, 2026 · 32 Views

@tatsu_hashimoto More seriously, I wud love to understand whether this claim holds for bioAI models and applications like DNALMs & single cell FMs that have enough training data but really struggle to learn effectively.

Anshul KundajeAnshul Kundaje@anshulkundaje

@tatsu_hashimoto Very cool. I am now expecting a flood of models on AI x bio that try to do the same thing (TBF they are already largely doing this with little success), without realizing at what scale & problem definitions this actually works. At least, I know who to blame.😆

5:44 PM · May 21, 2026 · 32 Views
5:47 PM · May 21, 2026 · 9 Views