/Tech39d ago

Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes

Work targets abundant compute and limited data pretraining scenarios.

851.6K163990215.9K

Original post

Tatsunori Hashimoto@tatsu_hashimoto#206inTech

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

8:51 AM · May 21, 2026 · 133.1K Views

Sentiment

Positive users find the Stanford study on large models benefiting from unfiltered data fascinating since it suggests compute outperforms filtering efforts, whereas negative users dismiss it as a scaling trap that conceals bad data.

Pos

34.7%

Neg

65.3%

15 comments with sentiment.

Cluster Engagement

Digg Deeper

Kevin Roseasked

Explain this in plain English for coders that are using LLMs to code

When training or fine-tuning LLMs on limited high-quality code data but with plenty of compute, skip aggressive filtering of noisy, low-quality, or even shuffled snippets. Large models tolerate—and often improve from—the extra volume because it acts as built-in regularization that curbs overfitting. For coders, this means throwing more raw GitHub dumps or mixed examples into your dataset instead of hand-curating perfect ones.