“Trust Region On-Policy Distillation”
On-policy distillation is powerful, but one bad mismatch between student and teacher can negatively impact the gradients.
So this paper's TrOPD only learns where the teacher is reliable, treats outliers separately, and nudges the student back onto teacher-like reasoning paths.
Providing much cleaner gradients and stabler long-CoT distillation.

