/AI3h ago

Prime Intellect's Elie Bakouch details MAI-Base-1's MoE scaling ladder, which derives all architectural configurations from model depth

The flagship configuration scales to 962 billion parameters.

--0--
Original post
elie@eliebakouch#716inAI

my favorite part: the scaling ladder 😍

the only knob they change is model depth (number of layers), everything else is derived from it with heuristics

elie@eliebakouch

they use loss-based load balancing (this is also what Qwen uses) and say that the optimal load balancing varies with the expert capacity.

expert capacity is the "max amount of tokens that an expert can process", this only makes sense with token dropping methods (if the model has too many tokens routed, you just drop them). but they end up using a dropless implementation (which is standard afaik?)

5:27 PM · Jun 2, 2026 · 985 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS1.6KBOOKMARKS4
elie@eliebakouch

the "loss" definition is VERY important, the scaling ladder heavily relies on this. it's a NLL private set (negative log likelihood) with:

50% code 17.5% STEM 17.5% Math 10% General knowledge 5% Multilingual

they then use this target NLL and normalize it with an in-house model. normalization matters because raw NLL scales differ across benchmarks

elie@eliebakouch

the rule to promote a new architecture is based on this scaling ladder. they have this Efficiency Gain (EG) metric which basically quantifies "to reach the loss our candidate got, how much more compute would the baseline have needed?"

"compute" here can mean flops or time, but as we'll see later, the pipeline is often optimized for flops first, then optimized for time.

the marin folks have a quite similar setup!

3hViews 1.6KLikes 13Bookmarks 4
LIKES17
elie@eliebakouch

one VERY BIG question when you have sourced the data is how to mix them. and this heavily depends on the metric you choose to optimize. the issue here is that optimizing the mixture for one domain means that you automatically lose on another one in most cases, it's illustrated nicely here with this nice html/code val loss plot

elie@eliebakouch

and here is all the "tools" that they use

2hViews 594Likes 17Bookmarks 1
RETWEETS1
elie@eliebakouch

the only knob they change is model depth (number of layers), everything is derived from it with heuristics.

first heuristic:

hidden size = L * 256/3

this is derived from recent models, here is how it compares to others.

other parameters: - fixed expert sparsity (unless ablated) - FFN expansion is 2x, latentMoE hyperparameters are 2x compression -> 3x expansion (see the plot on latentMoE to understand what this means)

elie@eliebakouch

my favorite part: the scaling ladder 😍

the only knob they change is model depth (number of layers), everything else is derived from it with heuristics

3hViews 944Likes 16Bookmarks 1
REPLIES3
elie@eliebakouch

and they even trace back to the fact that part of the STEM content was less deduped than others, which scales badly as flops increase.

this is actually the big point about dedup being so important here, it basically increases the number of effective epochs without you really knowing

elie@eliebakouch

data ablation setup

for data quality, they either upweight a single source by 50% and train from scratch to see marginal utility, or ablate within the full mixture on the scaling ladder with epoch-matched downsampling and forecast to target scale via EG. for mid-training they do single-source microanneals (LR decay from an intermediate checkpoint) to locally tune weights.

for data mixing, they do (1) thousands of small models on sampled mixtures to forecast optimal mix (2) hierarchical search, local within category + global between categories with an 8 epoch cap (3) verify by training finalists at ~2.8x compute and checking the optimum is scale-stable. the scaling ladder discipline is applied throughout, not just tacked on at the end

2hViews 531Likes 9Bookmarks 0