/AI13h ago

Researchers Ask If MoE Can Yield Dense Models Without Full Training

1266141.6K
Original postKangwook Lee#1585
Junhyuck Kim@jhyuckkim

Almost all "flagship" models are now MoEs.

But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter.

So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 馃У馃憞

6:41 AM 路 Jun 9, 2026 路 1.6K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS557BOOKMARKS1LIKES4RETWEETS1REPLIES2
Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87脳 lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

13hViews 557Likes 4Bookmarks 1
Junhyuck Kim@jhyuckkim

We see such compression-aware pretraining as a co-design direction worth exploring.

Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

13hViews 81Likes 3
Junhyuck Kim@jhyuckkim

2/ The structure of MoE makes this natural.

Per-expert computations are independent until the weighted sum. Concatenating their weights into a dense FFN preserves intermediate activations.

The problem comes down to which experts give the best dense FFN init for distillation.

13hViews 94Likes 1
Junhyuck Kim@jhyuckkim

5/ Across 350 configurations on Qwen3-30B-A3B, a clear pattern emerged:

diversity-aware selection (DO-ACP) with no merging (pure pruning) consistently wins after distillation.

The pattern holds on DeepSeek and GPT-OSS MoE models too.

13hViews 60Likes 1
Junhyuck Kim@jhyuckkim

4/ Our intuition is that output diversity should matter for selection.

Drawing inspiration from the D-Optimal criterion in experimental design, we introduce a diversity-aware scoring metric (DO-ACP) and compare with other expert scoring metrics.

13hViews 55Likes 1
Junhyuck Kim@jhyuckkim

We set up a pipeline that decouples these choices [number of experts / scoring / grouping / magnitude scaling] for systematic investigation.

13hViews 54Likes 1
Junhyuck Kim@jhyuckkim

7/ Benchmark numbers alone don鈥檛 tell the whole story, so we also conducted qualitative analysis.

MoE鈫抎ense (DO-ACP) wins over dense鈫抎ense (D2D) on two fronts: it is more often fluent and gets more facts right.

More details and examples in the paper!

13hViews 61
Junhyuck Kim@jhyuckkim

3/ The design space is wider than it looks.

E.g., from 128 experts, we can pick 8 and concatenate, or pick 32 and merge them into 8 groups of 4, etc. Both scoring and grouping metrics have multiple candidates from the expert-pruning/merging literature.

13hViews 61
Junhyuck Kim@jhyuckkim

6/ How does our best MoE鈫抎ense recipe compare to just pruning a dense model directly?

Surprisingly, at matched total params for teacher and student, our MoE鈫抎ense (DO-ACP) outperforms dense鈫抎ense (D2D) pruning by +6.3pp avg accuracy at ~1.6脳 faster training wall-clock.

13hViews 60
Junhyuck Kim@jhyuckkim

Please check out the paper for more details 馃檪

Code: http://github.com/krafton-ai/moe-to-dense Paper: http://arxiv.org/abs/2605.28207

This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

13hViews 78Likes 2
Suresh@_Suresh2

@jhyuckkim does the 87x pre-distillation PPL gap actually survive fine-tuning? i've had big ppl drops vanish after one epoch

8hViews 18Likes 1