16h ago

KAIST's Kimin Lee and Woogyeol Jin introduce EOPD, using forward KL at high-entropy tokens to improve reasoning model distillation

This fixes mode-seeking failures at critical reasoning decision points.

0
Original post

🚀 On-Policy Distillation (OPD) has gained attention for its efficiency over RLVR, thanks to its dense supervision signal. While reverse KL-based OPD effectively captures the teacher's dominant modes, it has a limitation. What's the problem? In reasoning tasks, high-entropy tokens, where the teacher hesitates, mark decision points where multiple valid reasoning paths diverge. OPD fails to transfer the teacher effectively at these positions. ✨ We introduce EOPD (Entropy-Aware On-Policy Distillation), which addresses this by augmenting OPD with a forward KL term on high-entropy tokens. 📄 Paper: https://arxiv.org/abs/2603.07079 💻 Code: http://github.com/WLS04/EOPD

12:24 AM · May 26, 2026 View on X

On-policy Distillation (OPD) can suffer from mode-seeking behavior due to the reverse KL objective. In our recent work, we address this by augmenting OPD with a forward KL term.

Please check out @wg_jin02 's post for more details!

Woogyeol JinWoogyeol Jin@wg_jin02

🚀 On-Policy Distillation (OPD) has gained attention for its efficiency over RLVR, thanks to its dense supervision signal. While reverse KL-based OPD effectively captures the teacher's dominant modes, it has a limitation. What's the problem? In reasoning tasks, high-entropy tokens, where the teacher hesitates, mark decision points where multiple valid reasoning paths diverge. OPD fails to transfer the teacher effectively at these positions. ✨ We introduce EOPD (Entropy-Aware On-Policy Distillation), which addresses this by augmenting OPD with a forward KL term on high-entropy tokens. 📄 Paper: https://arxiv.org/abs/2603.07079 💻 Code: http://github.com/WLS04/EOPD

7:24 AM · May 26, 2026 · 15.5K Views
8:14 AM · May 26, 2026 · 11.3K Views