4h ago

DenseNet co-creator Zhuang Liu finds that weaker teachers improve student LLMs, but overly strong teachers diminish performance gains

The study addresses knowledge distillation questions dating to 2015

49997416.0K

——0——

Original post

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher

12:14 PM · May 26, 2026

#83rohan anil@_AROHAN_

@liuzhuang1234 @TaiMingLu These are probably worth referring/looking at [1] one way or two way codistillation https://arxiv.org/abs/1804.03235 [2] https://arxiv.org/pdf/2410.18779 using smaller models to accelerate bigger ones

nice work!

Zhuang Liu@liuzhuang1234

7:14 PM · May 26, 2026 · 12.2K Views

8:43 PM · May 26, 2026 · 863 Views

QUOTE POST

#674Jiatao Gu@THOMA_GU

Very interesting work from @liuzhuang1234's lab!

This reminds me of our earlier work with @violet_zct on understanding distillation for non-autoregressive machine translation. One takeaway was that the strongest teacher is not always the best!

arxiv.org

Understanding Knowledge Distillation in Non-autoregressive Machine...

Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models....

Zhuang Liu@liuzhuang1234

7:14 PM · May 26, 2026 · 12.2K Views

8:01 PM · May 26, 2026 · 3.5K Views