Quite a jump from vanilla OPD. Here the authors use the difference between a privileged teacher and privileged student to compute a token-level advantage. The advantage is used to switch from weak/strong distillation modes.
🔗https://arxiv.org/abs/2606.30626


