Blog Explores Defenses Against AI Model Distillation Attacks

VIEWS783BOOKMARKS4LIKES12RETWEETS7

Check out our new paper on understanding the distillation game between model providers (who want to hide a capability) and adaptive students.

@stai_research @StanfordAILab

Mahdi Haghifam@HaghifamMahdi

"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 🧵

4h783124

REPLIES1

Mahdi Haghifam@HaghifamMahdi

When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.

Mahdi Haghifam@HaghifamMahdi

"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 🧵

4h18950

Mahdi Haghifam@HaghifamMahdi

Bottom line: antidistillation defenses must be:

1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient

Check out the full paper and our codebase here:

https://arxiv.org/abs/2605.22737

With @ys_alh @rzshokri @sanmikoyejo

Mahdi Haghifam@HaghifamMahdi

PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6× overhead (compared to 2.9× for prior methods) while preserving higher-quality reasoning.

4h15532

Mahdi Haghifam@HaghifamMahdi

But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.

Mahdi Haghifam@HaghifamMahdi

When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.

4h12240

Mahdi Haghifam@HaghifamMahdi

Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146

Mahdi Haghifam@HaghifamMahdi

We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.

4h20030

Mahdi Haghifam@HaghifamMahdi

PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6× overhead (compared to 2.9× for prior methods) while preserving higher-quality reasoning.

Mahdi Haghifam@HaghifamMahdi

Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)—a forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.

4h11730

Mahdi Haghifam@HaghifamMahdi

We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.

Mahdi Haghifam@HaghifamMahdi

But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.

4h11530

Mahdi Haghifam@HaghifamMahdi

Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)—a forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.

Mahdi Haghifam@HaghifamMahdi

Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146

4h11430

Mahdi Haghifam@HaghifamMahdi

Also check the blogpost ( https://antidistillation.com/) by @ashertrockman and @yashsavani_ on the motivations and challenges of designing defense against distillation attacks if you want to learn more about this line of work.

Mahdi Haghifam@HaghifamMahdi

Bottom line: antidistillation defenses must be:

1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient

Check out the full paper and our codebase here:

https://arxiv.org/abs/2605.22737

With @ys_alh @rzshokri @sanmikoyejo

4h11250