DenseNet co-creator Zhuang Liu finds that weaker teachers improve student LLMs, but overly strong teachers diminish performance gains
The study addresses knowledge distillation questions dating to 2015
@liuzhuang1234 @TaiMingLu These are probably worth referring/looking at [1] one way or two way codistillation https://arxiv.org/abs/1804.03235 [2] https://arxiv.org/pdf/2410.18779 using smaller models to accelerate bigger ones
nice work!
This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher
Very interesting work from @liuzhuang1234's lab!
This reminds me of our earlier work with @violet_zct on understanding distillation for non-autoregressive machine translation. One takeaway was that the strongest teacher is not always the best!

This is something I wanted to study long since I ever read the original knowledge distillation paper in 2015. Finally we've done it - with @TaiMingLu we thoroughly study the necessity of distillation from a "stronger" teacher