/AI9h ago

Microsoft excludes synthetic data and open-source datasets from its MAI-Base-1 training pipeline to enable cleaner downstream evaluation

Filtering cut the web corpus by 400 billion pages.

--0--
Original posts
Comments
Reposts
Harveen Singh Chadha@HarveenChadha

This is very interesting decision, microsoft decided not to use any LLM generated data or any open source training dataset for pretraining

12:09 PM · Jun 2, 2026 · 9.4K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS51REPLIES1
wh@nrehiew_

Pre-Training Data The pipeline looks pretty standard. 1 interesting thing is that they don't use any open source training datasets including stuff like huggingface presumably to facilitate downstream evaluation/decontamination

There is a TON of info in Appendix A. Generally they have 4 sources:

1) Most of their HTML data comes from a proprietary crawler instead of CC. (Completely unrelated but its hilarious that applying NSFW/Priacy filters reduces the number of web pages by 1/3 1.2T->800B)

Interestingly, they have an in house AI detection model and they use to filter out AI text. This is the first i've seen someone say this explicitly but I would be shocked if other labs dont have this.

They then have models that score the Qwen3Embeddings of these pages. They use a bunch of quality and heuristic filters which reduces to around 7.4B documents. They also human curate a high quality subset of long form writing

For STEM pages, they classify by topic and score by educational value. They also build a custom Latex to markdown processor

wh@nrehiew_

They have a subsection on why they chose NLL vs other forms of evals here: - Skips the need to do long generations with CoT or a judge model - Robustness to formatting quirks and MCQ variability - Difficulty in coming up with novel questions to test

50mViews 51Likes 0Bookmarks 0