Prime Intellect's Elie Bakouch details MAI-Base-1's MoE scaling ladder, which derives all architectural configurations from model depth · Digg

Prime Intellect's Elie Bakouch details MAI-Base-1's MoE scaling ladder, which derives all architectural configurations from model depth · Digg

Posts from X

Most Activity

VIEWS28KBOOKMARKS157LIKES268RETWEETS8REPLIES17

swyx @aiDotEngineer WF@swyx

probably the best reward function for reasoning efficiency i've seen

elie@eliebakouch

length penalty is very elegant and simple tbh

27d28K268157

elie@eliebakouch

length penalty is very elegant and simple tbh

elie@eliebakouch

main differences in GRPO are: > length penalty > entropy-based outer clip > no KL term (makes sense here since the model is not cold started) > normalization is global instead of per response

27d26.8K9138

elie@eliebakouch

the pre-training data part is amazing, and has a lot of olmo vibes to it (hi @soldni @kylelostat @HannaHajishirzi and co <3)

they put a lot of care into extraction and dedup (we will see a very good example of why dedup matters)

the data comes from both common crawl (very nice of them to say this) and private sources

no synthetic data (intentionally), and they have targeted sub-pipelines for different domains

elie@eliebakouch

they found a ton of public leakage in pre-training data, and hence why they don't trust public benchmarks for measuring improvement. NLL-based evaluations are also much faster and don't rely on capacity like multi-choice formats that are easily benchmaxable.

actually imo the only issue here is that this doesn't really transfer to post-training capacity, there might be a way to adapt it tho by doing NLL on reasoning traces?

27d5.6K4811

elie@eliebakouch

the "loss" definition is VERY important, the scaling ladder heavily relies on this. it's a NLL private set (negative log likelihood) with:

50% code 17.5% STEM 17.5% Math 10% General knowledge 5% Multilingual

they then use this target NLL and normalize it with an in-house model. normalization matters because raw NLL scales differ across benchmarks

elie@eliebakouch

the rule to promote a new architecture is based on this scaling ladder. they have this Efficiency Gain (EG) metric which basically quantifies "to reach the loss our candidate got, how much more compute would the baseline have needed?"

"compute" here can mean flops or time, but as we'll see later, the pipeline is often optimized for flops first, then optimized for time.

the marin folks have a quite similar setup!

27d5.8K4811

elie@eliebakouch

microsoft uses SGlang wow

elie@eliebakouch

a lot more on data curation/creation to bootstrap good reasoning without prior model

27d3.4K517

elie@eliebakouch

the full pipeline is EXTREMELY detailed in appendix A, with very precise numbers, this is amazing

elie@eliebakouch

the pre-training data part is amazing, and has a lot of olmo vibes to it (hi @soldni @kylelostat @HannaHajishirzi and co <3)

they put a lot of care into extraction and dedup (we will see a very good example of why dedup matters)

the data comes from both common crawl (very nice of them to say this) and private sources

no synthetic data (intentionally), and they have targeted sub-pipelines for different domains

27d3.4K397

elie@eliebakouch

and if you're done with this thread and still want to read more about this report, pease take a look at the goat @stochasticchasm recap

27d2.2K187

elie@eliebakouch

one very cool thing i forgot here is that the different domains don't react the same to things like architecture changes, here you can see that increasing sparsity helps code a lot but much less other domains which is a super interesting finding imo

elie@eliebakouch

one VERY BIG question when you have sourced the data is how to mix them. and this heavily depends on the metric you choose to optimize. the issue here is that optimizing the mixture for one domain means that you automatically lose on another one in most cases, it's illustrated nicely here with this nice html/code val loss plot

27d3.2K583

elie@eliebakouch

they found a ton of public leakage in pre-training data, and hence why they don't trust public benchmarks for measuring improvement. NLL-based evaluations are also much faster and don't rely on capacity like multi-choice formats that are easily benchmaxable.

actually imo the only issue here is that this doesn't really transfer to post-training capacity, there might be a way to adapt it tho by doing NLL on reasoning traces?

elie@eliebakouch

the "loss" definition is VERY important, the scaling ladder heavily relies on this. it's a NLL private set (negative log likelihood) with:

50% code 17.5% STEM 17.5% Math 10% General knowledge 5% Multilingual

they then use this target NLL and normalize it with an in-house model. normalization matters because raw NLL scales differ across benchmarks

27d4.8K404

elie@eliebakouch

there are some solutions here with automatic mixing (which is basically an optimization problem). idea is to have small scale proxies to predict larger scale optimal mixture.

and they found that this actually doesn't transfer at scale, here is the example they cite with a "stem heavy mix" vs "code heavy mix"

btw this is a 20T token ablation on a ~615B total param model which is almost the same compute as the final model training :)))))

elie@eliebakouch

one very cool thing i forgot here is that the different domains don't react the same to things like architecture changes, here you can see that increasing sparsity helps code a lot but much less other domains which is a super interesting finding imo

27d3.3K463

elie@eliebakouch

one VERY BIG question when you have sourced the data is how to mix them. and this heavily depends on the metric you choose to optimize. the issue here is that optimizing the mixture for one domain means that you automatically lose on another one in most cases, it's illustrated nicely here with this nice html/code val loss plot

elie@eliebakouch

and here is all the "tools" that they use

27d3.3K493

elie@eliebakouch

this was an insanely good read, i think this is the most detailed report i've read at this scale in some aspects. i really hope MAI continues releasing those tech reports, thanks a lot to the team for this gift 🥹 https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf#page=81.11

elie@eliebakouch

will conclude by this, 40% higher throughput per Watt (or is it different from "rack power budget"?) is pretty impressive and bullish on microsoft chips

27d1.3K344

elie@eliebakouch

the rule to promote a new architecture is based on this scaling ladder. they have this Efficiency Gain (EG) metric which basically quantifies "to reach the loss our candidate got, how much more compute would the baseline have needed?"

"compute" here can mean flops or time, but as we'll see later, the pipeline is often optimized for flops first, then optimized for time.

the marin folks have a quite similar setup!

elie@eliebakouch

for tokens per parameter (TPP), they mention it varies by ablation. ablations run at 100/200 TPP which is around "chinchilla optimal". chinchilla for dense is ~20 TPP, so a ~5-10x factor from their MoE setup? interesting

27d3.9K354

Frank Xu@frankxu2004

What a ride to work with you on coding RL. Insanely fun

Jiayi Wei@MrVPlusOne

I feel incredibly honored to have contributed to this work alongside the most talented and hardworking team I’ve ever worked with. @MicrosoftAI

Building and climbing an LLM from scratch was full of unknowns, but there are also many magical moments when things finally worked.

Excited to share what we learned and give back to the community! https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf

26d3.2K244

elie@eliebakouch

and here is all the "tools" that they use

elie@eliebakouch

the full pipeline is EXTREMELY detailed in appendix A, with very precise numbers, this is amazing

27d3.5K383

elie@eliebakouch

the only knob they change is model depth (number of layers), everything is derived from it with heuristics.

first heuristic:

hidden size = L * 256/3

this is derived from recent models, here is how it compares to others.

other parameters: - fixed expert sparsity (unless ablated) - FFN expansion is 2x, latentMoE hyperparameters are 2x compression -> 3x expansion (see the plot on latentMoE to understand what this means)

elie@eliebakouch

my favorite part: the scaling ladder 😍

the only knob they change is model depth (number of layers), everything else is derived from it with heuristics

27d4.3K422

elie@eliebakouch

the training loss is beautiful, no spike at all, the dream of every people pre-training model aha

elie@eliebakouch

about long context, they basically use the same mixture as 32k with proper packing, which makes sense because they don't have long agentic rollouts yet but also previous ai2 paper found that long context data didn't matter that much surprisingly during this phase

27d2.5K372

elie@eliebakouch

data ablation setup

for data quality, they either upweight a single source by 50% and train from scratch to see marginal utility, or ablate within the full mixture on the scaling ladder with epoch-matched downsampling and forecast to target scale via EG. for mid-training they do single-source microanneals (LR decay from an intermediate checkpoint) to locally tune weights.

for data mixing, they do (1) thousands of small models on sampled mixtures to forecast optimal mix (2) hierarchical search, local within category + global between categories with an 8 epoch cap (3) verify by training finalists at ~2.8x compute and checking the optimum is scale-stable. the scaling ladder discipline is applied throughout, not just tacked on at the end

elie@eliebakouch

and they even trace back to the fact that part of the STEM content was less deduped than others, which scales badly as flops increase.

this is actually the big point about dedup being so important here, it basically increases the number of effective epochs without you really knowing

27d2.9K303

elie@eliebakouch

here is the full precision scheme

elie@eliebakouch

some cool results on RMSNorm init impacting the contribution of attention at init (when random) and hence leading to small instabilities in the load balancing

27d2.3K214

elie@eliebakouch

here is open base model evaluated on their NLL bench

i also understand better this post by @NandoDF aha

Nando de Freitas@NandoDF

Your comments are absolutely correct. However, it depends on what you define as useful. Doing well on tasks that cover the 2500 topics could be useful.

Above all, reporting this for the base and post trained models would be super illuminating. I certainly would likely learn a lot from it 🤓

27d3.3K263