8/512 is pretty sparse especially compared to the rest of the field (most are around 8/256). But of course, we need to account for the dense layers.
(Some models arent included in this chart like M2.5 which is 8/256)
Load balancing is done at a global level which helps reduce variance.
They also do dropless routing. To avoid sudden spikes for any one experts they run multiple groups of capped MoE loops (this is dicussed later on)