/AI4h ago

Blog Explores Defenses Against AI Model Distillation Attacks

95826113.4K
Original postSanmi Koyejo#1056
Mahdi Haghifam@HaghifamMahdi

"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У

3:17 PM 路 Jun 9, 2026 路 1.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS783BOOKMARKS4LIKES12RETWEETS7

Check out our new paper on understanding the distillation game between model providers (who want to hide a capability) and adaptive students.

@stai_research @StanfordAILab

Mahdi Haghifam@HaghifamMahdi

"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У

4hViews 783Likes 12Bookmarks 4
REPLIES1
Mahdi Haghifam@HaghifamMahdi

When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.

Mahdi Haghifam@HaghifamMahdi

"Industrial-scale distillation" exposes proprietary models to capability extraction via APIs. Recent work asks how to prevent this at inference time while keeping utility. But evaluations assume a passive attacker. We show against a adaptive attacker, the defense disappears. 馃У

4hViews 189Likes 5Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

Bottom line: antidistillation defenses must be:

1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient

Check out the full paper and our codebase here:

https://arxiv.org/abs/2605.22737

With @ys_alh @rzshokri @sanmikoyejo

Mahdi Haghifam@HaghifamMahdi

PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6脳 overhead (compared to 2.9脳 for prior methods) while preserving higher-quality reasoning.

4hViews 155Likes 3Bookmarks 2
Mahdi Haghifam@HaghifamMahdi

But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.

Mahdi Haghifam@HaghifamMahdi

When a model exposes rich reasoning (like chain-of-thought), it risks leaking the exact steps needed to solve complex problems. Recent "antidistillation" methods intervene during generation to prevent this extraction, tweaking the outputs to hide the most useful training signals.

4hViews 122Likes 4Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146

Mahdi Haghifam@HaghifamMahdi

We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.

4hViews 200Likes 3Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

PoE performs competitively against SOTA defenses like ADS but requires significantly less compute. On GSM8K, PoE incurs only a 1.6脳 overhead (compared to 2.9脳 for prior methods) while preserving higher-quality reasoning.

Mahdi Haghifam@HaghifamMahdi

Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)鈥攁 forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.

4hViews 117Likes 3Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

We formalize this dynamic as a minimax game between a utility-constrained teacher and an adaptive student. This framework provides a tractable way to do: An adaptive evaluation strategy for the student A principled foundation to obtain new defense strategies for the teacher.

Mahdi Haghifam@HaghifamMahdi

But in practice, an attacker isn't passive. They are adaptive. They will score, filter, and reweight the generated examples to focus selectively on the data that provides the highest learning value. Evaluating defenses without accounting for this gives a false sense of security.

4hViews 115Likes 3Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

Beyond evaluation, we show our framework can be used to obtain new defense strategies. We derive Product-of-Experts (PoE)鈥攁 forward-pass-only defense that combines the teacher with a proxy student. It suppresses outputs with a large teacher-over-student likelihood gap.

Mahdi Haghifam@HaghifamMahdi

Evaluating against ADS (Savani et al.) reveals a massive "passive-adaptive gap." On GSM8K and MATH, we demonstrate that adaptive students recover ~50% more capability against state-of-the-art defenses than passive evaluations suggest. Savani et al.: https://arxiv.org/abs/2504.13146

4hViews 114Likes 3Bookmarks 0
Mahdi Haghifam@HaghifamMahdi

Also check the blogpost ( https://antidistillation.com/) by @ashertrockman and @yashsavani_ on the motivations and challenges of designing defense against distillation attacks if you want to learn more about this line of work.

Mahdi Haghifam@HaghifamMahdi

Bottom line: antidistillation defenses must be:

1) evaluated against adaptive, not passive, students, 2) reliable beyond just accuracy, 3) computationally efficient

Check out the full paper and our codebase here:

https://arxiv.org/abs/2605.22737

With @ys_alh @rzshokri @sanmikoyejo

4hViews 112Likes 5Bookmarks 0