4h ago

DenseNet co-creator Zhuang Liu finds that weaker teachers improve student LLMs, but overly strong teachers diminish performance gains

The study addresses knowledge distillation questions dating to 2015

0
Original post

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher

12:14 PM · May 26, 2026 View on X

@liuzhuang1234 @TaiMingLu These are probably worth referring/looking at [1] one way or two way codistillation https://arxiv.org/abs/1804.03235 [2] https://arxiv.org/pdf/2410.18779 using smaller models to accelerate bigger ones

nice work!

Zhuang LiuZhuang Liu@liuzhuang1234

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher

7:14 PM · May 26, 2026 · 12.2K Views
8:43 PM · May 26, 2026 · 863 Views

Very interesting work from @liuzhuang1234's lab!

This reminds me of our earlier work with @violet_zct on understanding distillation for non-autoregressive machine translation. One takeaway was that the strongest teacher is not always the best!

arxiv.org
Understanding Knowledge Distillation in Non-autoregressive Machine...
Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models....
Zhuang LiuZhuang Liu@liuzhuang1234

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher

7:14 PM · May 26, 2026 · 12.2K Views
8:01 PM · May 26, 2026 · 3.5K Views