/Tech1d ago

Researchers Ask If MoE Can Yield Dense Models Without Full Training

1406232.5K

#1729

Original post

Kangwook Lee#1729

Junhyuck Kim@jhyuckkim

Almost all "flagship" models are now MoEs.

But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter.

So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 🧵👇

6:41 AM · Jun 9, 2026 · 2.5K Views

/Tech1d ago

Researchers Ask If MoE Can Yield Dense Models Without Full Training

1406232.5K

#1729

Original post

Kangwook Lee#1729

Junhyuck Kim@jhyuckkim

Almost all "flagship" models are now MoEs.

But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter.

So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 🧵👇

6:41 AM · Jun 9, 2026 · 2.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS557BOOKMARKS1LIKES4RETWEETS1REPLIES2

Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

1d55741

Junhyuck Kim@jhyuckkim

We see such compression-aware pretraining as a co-design direction worth exploring.

Thanks to @sewon__min for introducing the work at ICLR, and to the authors @RyanYixiang @AkshitaB93 for the great work!

1d813

Junhyuck Kim@jhyuckkim

2/ The structure of MoE makes this natural.

Per-expert computations are independent until the weighted sum. Concatenating their weights into a dense FFN preserves intermediate activations.

The problem comes down to which experts give the best dense FFN init for distillation.

1d941

Junhyuck Kim@jhyuckkim

5/ Across 350 configurations on Qwen3-30B-A3B, a clear pattern emerged:

diversity-aware selection (DO-ACP) with no merging (pure pruning) consistently wins after distillation.

The pattern holds on DeepSeek and GPT-OSS MoE models too.

1d601

Junhyuck Kim@jhyuckkim

4/ Our intuition is that output diversity should matter for selection.

Drawing inspiration from the D-Optimal criterion in experimental design, we introduce a diversity-aware scoring metric (DO-ACP) and compare with other expert scoring metrics.

1d551

Junhyuck Kim@jhyuckkim

We set up a pipeline that decouples these choices [number of experts / scoring / grouping / magnitude scaling] for systematic investigation.

1d541

Junhyuck Kim@jhyuckkim

7/ Benchmark numbers alone don’t tell the whole story, so we also conducted qualitative analysis.

MoE→dense (DO-ACP) wins over dense→dense (D2D) on two fronts: it is more often fluent and gets more facts right.

More details and examples in the paper!

1d61

Junhyuck Kim@jhyuckkim

3/ The design space is wider than it looks.

E.g., from 128 experts, we can pick 8 and concatenate, or pick 32 and merge them into 8 groups of 4, etc. Both scoring and grouping metrics have multiple candidates from the expert-pruning/merging literature.

1d61

Junhyuck Kim@jhyuckkim

6/ How does our best MoE→dense recipe compare to just pruning a dense model directly?

Surprisingly, at matched total params for teacher and student, our MoE→dense (DO-ACP) outperforms dense→dense (D2D) pruning by +6.3pp avg accuracy at ~1.6× faster training wall-clock.

1d60

Junhyuck Kim@jhyuckkim

Please check out the paper for more details 🙂

Code: http://github.com/krafton-ai/moe-to-dense Paper: http://arxiv.org/abs/2605.28207

This is joint work with amazing collaborators: @JihunYun_ai, Haechan, @gmkim_ai, Joonghyun, @jaewoong_cho from @Krafton_AI

1d782

Suresh@_Suresh2

@jhyuckkim does the 87x pre-distillation PPL gap actually survive fine-tuning? i've had big ppl drops vanish after one epoch

1d181