Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.
Stanford researchers release paper 'A Bitter Lesson for Data Filtering' showing large models improve without data filtering in high-compute data-scarce regimes
Work targets abundant compute and limited data pretraining scenarios.
Positive users find the Stanford study on large models benefiting from unfiltered data fascinating since it suggests compute outperforms filtering efforts, whereas negative users dismiss it as a scaling trap that conceals bad data.

Explain this in plain English for coders that are using LLMs to code
When training or fine-tuning LLMs on limited high-quality code data but with plenty of compute, skip aggressive filtering of noisy, low-quality, or even shuffled snippets. Large models tolerate—and often improve from—the extra volume because it acts as built-in regularization that curbs overfitting. For coders, this means throwing more raw GitHub dumps or mixed examples into your dataset instead of hand-curating perfect ones.

