/Tech1d ago

Modularity-Aware Pretraining Delivers 3.6pp Lift for Dense Student Models

2411946

Original post

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 946 Views

/Tech1d ago

2411946

Original post

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 946 Views

Sentiment

Users praise modularity-aware pretraining for boosting MoE distillation because it offers a promising co-design direction worth exploring.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS81LIKES3REPLIES1

We see such compression-aware pretraining as a co-design direction worth exploring.

Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

1d813

Please check out the paper for more details 🙂

Code: http://github.com/krafton-ai/moe-to-dense Paper: http://arxiv.org/abs/2605.28207

This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

1d782

Suresh@_Suresh2

@jhyuckkim does the 87x pre-distillation PPL gap actually survive fine-tuning? i've had big ppl drops vanish after one epoch

1d181