/AI5h ago

MAI-Base-1 Deploys 8/512 Sparse MoE With Global Load Balancing

--0--
Comments
Original post
wh@nrehiew_#1430inAI

8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.

(Some models arent included in this chart like M2.5 which is 8/256)

wh@nrehiew_

Load balancing is done at a global level which helps reduce variance.

They also do dropless routing. To avoid sudden spikes for any one experts they run multiple groups of capped MoE loops (this is dicussed later on)

8:27 PM · Jun 2, 2026 · 328 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS306
wh@nrehiew_

Their architecture is slightly non-standard, so their next section is on ablation. They fit a bunch of models and essentially compute FLOPs required to reach some baseline loss.

EG in the figures is essentially how much more flops did the baseline model require. Higher values means the baseline required more flops = the candidate is more efficient. They compute EG both in terms of FLOPs and time (accounting for MFU)

For example, the table shows MoE every layer against their interleaved layout. They are able to get better MFU with the interleaved which leads to faster wallclock time even though FLOPs required is ~similar

The other plot shows increased gains as sparsity increases.

wh@nrehiew_

8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.

(Some models arent included in this chart like M2.5 which is 8/256)

5hViews 306Likes 3Bookmarks 0
BOOKMARKS2REPLIES1
wh@nrehiew_

Extremely heavy focus on determinism (think Dsv4)

wh@nrehiew_

Systems (YOLO) - Custom kernels for FP8 GEMM and quantization, grouped GEMM - Training is done with a mix of Zero2 and Zero3 (last stage of midtraining). Interestingly, their implementation of Zero always shards parameters. And different stages is just how often the param buffer is cleared? - Ulysess CP

They pipeline expert computation locally within sub groups of expert which means only the first and last A2A cannot be overlapped. For their dropless implementation, they have multiple subbatches per group.

5hViews 164Likes 3Bookmarks 2
LIKES4
wh@nrehiew_

Systems (YOLO) - Custom kernels for FP8 GEMM and quantization, grouped GEMM - Training is done with a mix of Zero2 and Zero3 (last stage of midtraining). Interestingly, their implementation of Zero always shards parameters. And different stages is just how often the param buffer is cleared? - Ulysess CP

They pipeline expert computation locally within sub groups of expert which means only the first and last A2A cannot be overlapped. For their dropless implementation, they have multiple subbatches per group.

wh@nrehiew_

Appreciate the loss spike transparency which they attribute to high expert imbalance during coding datasets

5hViews 191Likes 4Bookmarks 1