@teortaxesTex tbh i think its likely that strong to weak distillation is just significantly simpler and better than OPD et al
There are two hypotheses for the DeepSeek-V4's strange performance (as in, V4-Flash is about as good as we expected, but V4-Pro is disappointing given its scale): 1) failed pretrain 2) big difference in the RL/MOPD stage Flash probably got multiple such iterations