This is very interesting decision, microsoft decided not to use any LLM generated data or any open source training dataset for pretraining
Microsoft excludes synthetic data and open-source datasets from its MAI-Base-1 training pipeline to enable cleaner downstream evaluation
Filtering cut the web corpus by 400 billion pages.
Most Activity
Pre-Training Data The pipeline looks pretty standard. 1 interesting thing is that they don't use any open source training datasets including stuff like huggingface presumably to facilitate downstream evaluation/decontamination
There is a TON of info in Appendix A. Generally they have 4 sources:
1) Most of their HTML data comes from a proprietary crawler instead of CC. (Completely unrelated but its hilarious that applying NSFW/Priacy filters reduces the number of web pages by 1/3 1.2T->800B)
Interestingly, they have an in house AI detection model and they use to filter out AI text. This is the first i've seen someone say this explicitly but I would be shocked if other labs dont have this.
They then have models that score the Qwen3Embeddings of these pages. They use a bunch of quality and heuristic filters which reduces to around 7.4B documents. They also human curate a high quality subset of long form writing
For STEM pages, they classify by topic and score by educational value. They also build a custom Latex to markdown processor
They have a subsection on why they chose NLL vs other forms of evals here: - Skips the need to do long generations with CoT or a judge model - Robustness to formatting quirks and MCQ variability - Difficulty in coming up with novel questions to test