/AI7h ago

Modularity-Aware Pretraining Delivers 3.6pp Lift for Dense Student Models

2411433
Original postSewon Min#188
Junhyuck Kim@jhyuckkim

8/ Finally, a complementary direction: we tested compatibility with modularity-aware pretraining (EMO, https://arxiv.org/abs/2605.06663).

Modularity-aware pretraining gives a +3.6pp lift and ~87× lower pre-distillation PPL on the dense student model vs. a regular-MoE teacher.

6:41 AM · Jun 9, 2026 · 424 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
No ranked X posts are available for this story yet.