A large MoE model may be wasting half its expert compute on tokens that barely need expert help.
In this paper 50% of expert computation removed, with almost no loss in accuracy.
This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them.
Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones.
Shows that many MoE tokens do not need real experts, only permission to skip them.
That sounds like a small routing trick, but it changes the economics of deployed language models.
Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token.
ZEDA adds a strange new option to the router: experts that output exactly nothing.
When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation.
The clever part is not the dummy expert, but the adaptation method.
Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work.
Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup.
The deeper finding is: compute use did not simply track task difficulty.
The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less.
That makes ZEDA feel less like pruning and more like attention to computational doubt.
----
Paper Link – arxiv. org/abs/2605.18643
Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"
