KAIST's Kimin Lee and Woogyeol Jin introduce EOPD, using forward KL at high-entropy tokens to improve reasoning model distillation
This fixes mode-seeking failures at critical reasoning decision points.
On-policy Distillation (OPD) can suffer from mode-seeking behavior due to the reverse KL objective. In our recent work, we address this by augmenting OPD with a forward KL term.
Please check out @wg_jin02 's post for more details!
🚀 On-Policy Distillation (OPD) has gained attention for its efficiency over RLVR, thanks to its dense supervision signal. While reverse KL-based OPD effectively captures the teacher's dominant modes, it has a limitation. What's the problem? In reasoning tasks, high-entropy tokens, where the teacher hesitates, mark decision points where multiple valid reasoning paths diverge. OPD fails to transfer the teacher effectively at these positions. ✨ We introduce EOPD (Entropy-Aware On-Policy Distillation), which addresses this by augmenting OPD with a forward KL term on high-entropy tokens. 📄 Paper: https://arxiv.org/abs/2603.07079 💻 Code: http://github.com/WLS04/EOPD