/AI9h ago

Microsoft excludes synthetic data and open-source datasets from its MAI-Base-1 training pipeline to enable cleaner downstream evaluation

Filtering cut the web corpus by 400 billion pages.

91278459.4K

Original posts

Comments

#1430

Reposts

#276

Original post

Hanna Hajishirzi#276

Harveen Singh Chadha@HarveenChadha

This is very interesting decision, microsoft decided not to use any LLM generated data or any open source training dataset for pretraining

12:09 PM · Jun 2, 2026 · 9.4K Views

/AI9h ago

Microsoft excludes synthetic data and open-source datasets from its MAI-Base-1 training pipeline to enable cleaner downstream evaluation

Filtering cut the web corpus by 400 billion pages.

--0--

Original posts

Comments

#1430

Reposts

#276

Original post

Hanna Hajishirzi#276

Harveen Singh Chadha@HarveenChadha

This is very interesting decision, microsoft decided not to use any LLM generated data or any open source training dataset for pretraining

12:09 PM · Jun 2, 2026 · 9.4K Views

Sentiment

Users are praising Microsoft's decision to train MAI-Base-1 without LLM-generated or open source data because they see the approach as sensible and potentially higher quality.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS51REPLIES1

wh@nrehiew_

Pre-Training Data The pipeline looks pretty standard. 1 interesting thing is that they don't use any open source training datasets including stuff like huggingface presumably to facilitate downstream evaluation/decontamination

There is a TON of info in Appendix A. Generally they have 4 sources:

1) Most of their HTML data comes from a proprietary crawler instead of CC. (Completely unrelated but its hilarious that applying NSFW/Priacy filters reduces the number of web pages by 1/3 1.2T->800B)

Interestingly, they have an in house AI detection model and they use to filter out AI text. This is the first i've seen someone say this explicitly but I would be shocked if other labs dont have this.

They then have models that score the Qwen3Embeddings of these pages. They use a bunch of quality and heuristic filters which reduces to around 7.4B documents. They also human curate a high quality subset of long form writing

For STEM pages, they classify by topic and score by educational value. They also build a custom Latex to markdown processor

wh@nrehiew_

They have a subsection on why they chose NLL vs other forms of evals here: - Skips the need to do long generations with CoT or a judge model - Robustness to formatting quirks and MCQ variability - Difficulty in coming up with novel questions to test

50m5100