/Tech1d ago

Modularity-Aware Pretraining Delivers 3.6pp Lift for Dense Student Models

2411946
Original postSewon Min#201
Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 946 Views
Sentiment

Users praise modularity-aware pretraining for boosting MoE distillation because it offers a promising co-design direction worth exploring.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS81LIKES3REPLIES1
Junhyuck Kim@jhyuckkim

We see such compression-aware pretraining as a co-design direction worth exploring.

Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

1dViews 81Likes 3
Junhyuck Kim@jhyuckkim

Please check out the paper for more details 🙂

Code: http://github.com/krafton-ai/moe-to-dense Paper: http://arxiv.org/abs/2605.28207

This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

1dViews 78Likes 2
Suresh@_Suresh2

@jhyuckkim does the 87x pre-distillation PPL gap actually survive fine-tuning? i've had big ppl drops vanish after one epoch

1dViews 18Likes 1