In our new paper, EDGE-OPD (EviDence GuidEd On-Policy Distillation), we introduce two key ideas for improving On-Policy Distillation (OPD).
First, we use guided rollouts that inject privileged context into a subset of the student’s prompts during generation (such as hints, hidden evidence, or rare personas), allowing rare target behaviors to actually appear in the on-policy data, while masking that privileged information during loss computation.
Second, we introduce an evidence mask that updates the student only on token positions supported by the privileged context, instead of learning from the entire rollout indiscriminately. We show empirically that, together, these mechanisms enable the student to internalize rare but correct behaviors without directly training on the hidden context itself. Beyond the method itself, we also analyze how these rare-token signals localize within specific regions of the rollout, offering new insights into efficient knowledge transfer and preservation of general-purpose capabilities during post-training.
Read the blog post and paper here: https://www.edgerunnerai.com/news/edge-opd-internalizing-private-data-with-on-policy-distillation
