Original post
Sewon Min#188
Junhyuck Kim@jhyuckkim
8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).
Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.
6:41 AM · Jun 9, 2026 · 424 Views