/Tech12h ago

Junhyuck Kim finds modularity-aware pretraining improves accuracy by 3.6 percentage points when distilling MoE models into dense student models

The approach cut WikiText-2 pre-distillation perplexity by 87 times.

3286141.9K

#931

Original post

Sewon Min#931

Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 567 Views

/Tech12h ago

Junhyuck Kim finds modularity-aware pretraining improves accuracy by 3.6 percentage points when distilling MoE models into dense student models

The approach cut WikiText-2 pre-distillation perplexity by 87 times.

3286141.9K

#931

Original post

Sewon Min#931

Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 567 Views

Sentiment

Users praise modularity-aware pretraining for MoE distillation gains as a promising compression-aware co-design direction worth exploring, thanking its ICLR presentation.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS81LIKES3REPLIES1

Junhyuck Kim@jhyuckkim

We see such compression-aware pretraining as a co-design direction worth exploring.

Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

12h813

Junhyuck Kim@jhyuckkim

Please check out the paper for more details 🙂

Code: http://github.com/krafton-ai/moe-to-dense Paper: http://arxiv.org/abs/2605.28207

This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

12h782

Suresh@_Suresh2

@jhyuckkim does the 87x pre-distillation PPL gap actually survive fine-tuning? i've had big ppl drops vanish after one epoch

6h181