my favorite part: the scaling ladder 😍
the only knob they change is model depth (number of layers), everything else is derived from it with heuristics
they use loss-based load balancing (this is also what Qwen uses) and say that the optimal load balancing varies with the expert capacity.
expert capacity is the "max amount of tokens that an expert can process", this only makes sense with token dropping methods (if the model has too many tokens routed, you just drop them). but they end up using a dropless implementation (which is standard afaik?)