/AI5h ago

MAI-Base-1 Deploys 8/512 Sparse MoE With Global Load Balancing

720051.6K

Comments

Original post

wh@nrehiew_#1430inAI

8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.

(Some models arent included in this chart like M2.5 which is 8/256)

wh@nrehiew_

Load balancing is done at a global level which helps reduce variance.

They also do dropless routing. To avoid sudden spikes for any one experts they run multiple groups of capped MoE loops (this is dicussed later on)

8:27 PM · Jun 2, 2026 · 328 Views

/AI5h ago

MAI-Base-1 Deploys 8/512 Sparse MoE With Global Load Balancing

--0--

Comments

#1430

Original post

wh@nrehiew_#1430inAI

8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.

(Some models arent included in this chart like M2.5 which is 8/256)

wh@nrehiew_

Load balancing is done at a global level which helps reduce variance.

They also do dropless routing. To avoid sudden spikes for any one experts they run multiple groups of capped MoE loops (this is dicussed later on)

8:27 PM · Jun 2, 2026 · 328 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

wh@nrehiew_

Their architecture is slightly non-standard, so their next section is on ablation. They fit a bunch of models and essentially compute FLOPs required to reach some baseline loss.

EG in the figures is essentially how much more flops did the baseline model require. Higher values means the baseline required more flops = the candidate is more efficient. They compute EG both in terms of FLOPs and time (accounting for MFU)

For example, the table shows MoE every layer against their interleaved layout. They are able to get better MFU with the interleaved which leads to faster wallclock time even though FLOPs required is ~similar

The other plot shows increased gains as sparsity increases.

wh@nrehiew_

8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.

(Some models arent included in this chart like M2.5 which is 8/256)

5h30630

BOOKMARKS2REPLIES1

wh@nrehiew_

Extremely heavy focus on determinism (think Dsv4)

wh@nrehiew_

Systems (YOLO) - Custom kernels for FP8 GEMM and quantization, grouped GEMM - Training is done with a mix of Zero2 and Zero3 (last stage of midtraining). Interestingly, their implementation of Zero always shards parameters. And different stages is just how often the param buffer is cleared? - Ulysess CP

They pipeline expert computation locally within sub groups of expert which means only the first and last A2A cannot be overlapped. For their dropless implementation, they have multiple subbatches per group.

5h16432

LIKES4

wh@nrehiew_

Appreciate the loss spike transparency which they attribute to high expert imbalance during coding datasets

5h19141