Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

REPLY

#55Lucas Beyer (bl16)@GIFFMANA

@sainingxie @tatsu_hashimoto 🤗

Saining Xie@sainingxie

@tatsu_hashimoto @giffmana TheRightWay™ strikes again?

5:12 PM · May 21, 2026 · 1.8K Views

6:49 PM · May 21, 2026 · 44 Views

REPLY

#55Lucas Beyer (bl16)@GIFFMANA

Yep! I think your paper is a bit of a mix (spiritually) of our NoFilter and our veeeery boringly-titled (and hence mostly unknown) multitask study paper (https://arxiv.org/pdf/2303.17376), it has this experiment which is the equivalent of your experiment with the epochs and added shuffled/noisy/unrelated data. We should have put a cross at the epoch boundary too, that's a great idea.

I think the underlying effect is "regularization" more than "transfer", though admittedly both these terms are kinda vague to begin with. And it's very cool to see the same effects happen in language as in vision, confirms my prior but now i can point to your paper, so thanks for your work :)

Tatsunori Hashimoto@tatsu_hashimoto

@sainingxie @giffmana This set of results would argue even more strongly for things like NoFilter, in that in sufficiently high compute regimes for LMs, you may actually get benefits on the majority group (the 'high quality filtered set' we eval on) by including the minority group.

5:48 PM · May 21, 2026 · 468 Views

7:00 PM · May 21, 2026 · 40 Views

QUOTE POST

#55Lucas Beyer (bl16)@GIFFMANA

@tatsu_hashimoto @sainingxie I mean this experiment of yours. Sorry if it was a bit unclear, I'm writing on the phone, which i hate.

Tatsunori Hashimoto@tatsu_hashimoto

This is the case even for more targeted “bad” data, like shuffling the tokens or even injecting completely random token sequences. We find that adding in very large amounts of fresh “bad” data doesn't hurt (and even helps, for shuffled data), compared to more epoching.

3:51 PM · May 21, 2026 · 1.9K Views

7:01 PM · May 21, 2026 · 40 Views

QUOTE POST

#158Saining Xie@SAININGXIE

@tatsu_hashimoto @giffmana TheRightWay™ strikes again?

5:12 PM · May 21, 2026 · 1.8K Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

3:51 PM · May 21, 2026 · 2.2K Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

This is the case even for more targeted “bad” data, like shuffling the tokens or even injecting completely random token sequences. We find that adding in very large amounts of fresh “bad” data doesn't hurt (and even helps, for shuffled data), compared to more epoching.

Tatsunori Hashimoto@tatsu_hashimoto

https://arxiv.org/abs/2605.19407 We down-sample the common crawl pool and apply filters on top, simulating a smaller data universe. With enough compute, training on the pool catches up to every filter in DCLM, even when we eval on PPL for a higher-quality, filtered corpus.

3:51 PM · May 21, 2026 · 2.2K Views

3:51 PM · May 21, 2026 · 1.9K Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

@sainingxie @giffmana This set of results would argue even more strongly for things like NoFilter, in that in sufficiently high compute regimes for LMs, you may actually get benefits on the majority group (the 'high quality filtered set' we eval on) by including the minority group.

Saining Xie@sainingxie

@tatsu_hashimoto @giffmana TheRightWay™ strikes again?

5:12 PM · May 21, 2026 · 1.8K Views

5:48 PM · May 21, 2026 · 468 Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

@anshulkundaje I think part of this is an argument along the lines of "even low-quality data has *some* structure, and that is better than using more weight decay". In practice, you'd rather spend your time doing data augmentation and so on first, and even then, there is a hard limit..

Anshul Kundaje@anshulkundaje

@tatsu_hashimoto More seriously, I wud love to understand whether this claim holds for bioAI models and applications like DNALMs & single cell FMs that have enough training data but really struggle to learn effectively.

5:47 PM · May 21, 2026 · 143 Views

5:51 PM · May 21, 2026 · 405 Views

REPLY

#231Tatsunori Hashimoto@TATSU_HASHIMOTO

@PandaAshwinee @ChengleiSi I think the regime where this is true is very far out on the compute scales. It's after you've exhausted ensembling / synth data / etc. Even in the "naive" case where you dont do this, we don't expect to see these effects for several orders of magnitude more compute, not 27B.

Ashwinee Panda@PandaAshwinee

@tatsu_hashimoto @ChengleiSi not sure i agree -we’re going to post results soon showing that 7B “doesn’t benefit as much” from filtering (on DCLM) vs 1B, yes, but i wouldn’t extrapolate that trend out to expect “no improvement” at 27B

5:51 PM · May 21, 2026 · 398 Views

5:55 PM · May 21, 2026 · 349 Views

REPLY

#437Xiao Ma@INFOXIAO

@tatsu_hashimoto @ChengleiSi you lost me at "with enough compute"

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

4:30 PM · May 21, 2026 · 611 Views

QUOTE POST

#442CLS@CHENGLEISI

Things are weird in the (severely) data-constrained regime. Tatsu is always thinking far ahead about the future!

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

4:29 PM · May 21, 2026 · 2K Views

REPLY

#1365Ashwinee Panda@PANDAASHWINEE

@tatsu_hashimoto @ChengleiSi not sure i agree -we’re going to post results soon showing that 7B “doesn’t benefit as much” from filtering (on DCLM) vs 1B, yes, but i wouldn’t extrapolate that trend out to expect “no improvement” at 27B

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

5:51 PM · May 21, 2026 · 398 Views

QUOTE POST

#1381Zihan "Zenus" Wang@WZENUS

I love the work a lot, but most of the time people are still under budget, and recently more so in post-training like RL.

When each rollout is noisy and takes a lot of money and time, filtering good ones cleverly can be much better than scaling up (which we cover in RAGEN-2).

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

7:13 PM · May 21, 2026 · 370 Views

QUOTE POST

#1460Jiaxin Wen@JIAXINWEN22

Multi-epoch pre-training should be the default setting for pre-training papers

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

5:40 PM · May 21, 2026 · 3.5K Views

REPLY

#1675Anshul Kundaje@ANSHULKUNDAJE

@tatsu_hashimoto Very cool. I am now expecting a flood of models on AI x bio that try to do the same thing (TBF they are already largely doing this with little success), without realizing at what scale & problem definitions this actually works. At least, I know who to blame.😆

Tatsunori Hashimoto@tatsu_hashimoto

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

3:51 PM · May 21, 2026 · 37.5K Views

5:44 PM · May 21, 2026 · 439 Views

REPLY

#1675Anshul Kundaje@ANSHULKUNDAJE

@tatsu_hashimoto More seriously, I wud love to understand whether this claim holds for bioAI models and applications like DNALMs & single cell FMs that have enough training data but really struggle to learn effectively.

Anshul Kundaje@anshulkundaje

@tatsu_hashimoto Very cool. I am now expecting a flood of models on AI x bio that try to do the same thing (TBF they are already largely doing this with little success), without realizing at what scale & problem definitions this actually works. At least, I know who to blame.😆

5:44 PM · May 21, 2026 · 439 Views

5:47 PM · May 21, 2026 · 143 Views

Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

Cluster engagement

Sentiment