wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation
thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐
The work tests whether shortening lengthy reasoning traces from large teacher models before distillation can deliver efficiency wins without hurting the smaller student's final performance, and finds that uncompressed traces maintain the highest accuracy at every scale tested even as compression slashes training tokens and speeds up runs.
wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation
thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐
Across dozens of teacher-student runs the uncompressed versions outperformed both model-compressed and length-truncated alternatives, though the compressed students still reached up to 96 percent of that peak while using far fewer tokens.
The arXiv preprint details controlled ablations but leaves open whether the observed gap will hold once independent labs can rerun the grid on their own student setups.
No Digg Deeper questions have been answered for this story yet.
CoT summarization/hiding is a straightforwardly efficient gatekeeping mechanism.
wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation
thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐
paper is https://arxiv.org/abs/2606.05988v1
wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation
thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐
@teortaxesTex Hiding is the thing that’s really effective
You’d guess that Ant would’ve started hiding them a long time ago, but alas
CoT summarization/hiding is a straightforwardly efficient gatekeeping mechanism.

@xeophon @iScienceLuvr On the other side, would be interesting to see compression of raw into shortest possible sentences vs compression into long ones; intuition tells me the difference would be negligible

@teortaxesTex 要約されたデータじゃ知識の蒸留(ディスチレーション)に限界があるのは現場のエンジニアなら薄々気づいてたよね。ただこうやってデータで証明されると、今後Rawデータを開示しない企業のモデルを使う価値があるのかって議論になりそう。

@teortaxesTex 这道门槛确实筛选掉了不少伸手党