"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У
Most Activity
Check out our new paper on understanding the distillation game between model providers (who want to hide a capability) and adaptive students.
@stai_research @StanfordAILab
"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У
When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.
"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У
Bottom line: antidistillation defenses must be:
1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient
Check out the full paper and our codebase here:
https://arxiv.org/abs/2605.22737
With @ys_alh @rzshokri @sanmikoyejo
PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6脳 overhead (compared to 2.9脳 for prior methods) while preserving higher-quality reasoning.
But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.
When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.
Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146
We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.
PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6脳 overhead (compared to 2.9脳 for prior methods) while preserving higher-quality reasoning.
Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)鈥攁 forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.
We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.
But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.
Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)鈥攁 forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.
Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146
Also check the blogpost ( https://antidistillation.com/) by @ashertrockman and @yashsavani_ on the motivations and challenges of designing defense against distillation attacks if you want to learn more about this line of work.
Bottom line: antidistillation defenses must be:
1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient
Check out the full paper and our codebase here:
https://arxiv.org/abs/2605.22737
With @ys_alh @rzshokri @sanmikoyejo