/Tech2h ago

Teortaxes argues top-k distillation matches policy-gradient performance with less compute but collapses when using different-origin teacher models

Story Overview

Teortaxes highlights how top-k distillation can deliver performance on par with policy-gradient methods during same-origin teacher comparisons on Qwen3-scale models while cutting the infrastructure load through smaller signal transmission. The approach breaks down sharply once the teacher originates from a different model, even inside the same family, whereas policy-gradient degrades more gracefully. This observed fragility is presented as a plausible reason full-vocabulary distillation was retained in V4 pipelines.

1100461

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

They find that PG and top-k distillations are comparable so long as you use same-origin teachers, and top-k is cheaper. But if you try to get cute with a different model, even same family… PG struggles, top-k totally collapses. I think this is why V4 went with full-vocabulary.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Chinese labs are iterating on OPD quickly

3:17 AM · Jun 30, 2026 · 227 Views

Open Question

Cross-origin teacher mismatches limit cheap approximations

When teacher and student come from separate origins, top-k distillation loses stability faster than policy-gradient baselines, forcing practitioners to weigh whether the compute savings justify the added risk of training collapse. No public benchmarks yet quantify the exact performance gap or optimal K values for mixed-teacher setups.

Developer Impact

V4 retained full-vocabulary distillation for stability reasons

The same-origin success of top-k does not extend reliably to heterogeneous teacher mixes, which may explain why V4 avoided the lighter method despite its lower per-step cost. Multi-teacher pipelines now face a concrete trade-off between efficiency and robustness that current stabilizers have not fully resolved.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Might be of interest to @Grad62304977 @rawsh0 @stochasticchasm and others

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

2h21200

LIKES1

Grad@Grad62304977

@teortaxesTex @rawsh0 @stochasticchasm I’d say this is still bullish on mix-RL given MOPD needs the data and compute to train the teacher first, mostly looks like mix rl and MOPD would take the same compute Although u also get the potential benefit of cross domain generalisation with mix RL esp with larger models

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Might be of interest to @Grad62304977 @rawsh0 @stochasticchasm and others

24m2210