Almost all "flagship" models are now MoEs.
But smaller models still prefer to be dense as they target memory-constrained scenarios where total params matter.
So we ask: Can we leverage an MoE to produce dense models without having to train them from scratch? 馃У馃憞

