Junhyuck Kim finds modularity-aware pretraining improves accuracy by 3.6 percentage points when distilling MoE models into dense student models · Digg