/Tech2h ago

Study finds compressing Chain-of-Thought traces during knowledge distillation degrades downstream student model accuracy

Story Overview

The work tests whether shortening lengthy reasoning traces from large teacher models before distillation can deliver efficiency wins without hurting the smaller student's final performance, and finds that uncompressed traces maintain the highest accuracy at every scale tested even as compression slashes training tokens and speeds up runs.

7564205.7K

#501

Original post

Florian Brand@xeophon#1778inTech

wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation

thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐

1:52 AM · Jun 27, 2026 · 3.2K Views

Accuracy Trade-off

Raw traces hold the accuracy lead

Across dozens of teacher-student runs the uncompressed versions outperformed both model-compressed and length-truncated alternatives, though the compressed students still reached up to 96 percent of that peak while using far fewer tokens.

Open Question

No code or data release yet

The arXiv preprint details controlled ablations but leaves open whether the observed gap will hold once independent labs can rerun the grid on their own student setups.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

ARXIV.ORGVia

#1778

Posts from X

Most Activity

VIEWS2.5KBOOKMARKS7LIKES18RETWEETS1REPLIES3

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

CoT summarization/hiding is a straightforwardly efficient gatekeeping mechanism.

Florian Brand@xeophon

wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation

thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐

2h2.5K187

Florian Brand@xeophon

paper is https://arxiv.org/abs/2606.05988v1

Florian Brand@xeophon

wanted an argument that summarized traces are okay, but only raw CoT is really useful for distillation

thanks to @iScienceLuvr to write a paper about that, confirming the hypothesis 🐐

2h42594

Florian Brand@xeophon

@teortaxesTex Hiding is the thing that’s really effective

You’d guess that Ant would’ve started hiding them a long time ago, but alas

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

CoT summarization/hiding is a straightforwardly efficient gatekeeping mechanism.

2h20740

EternalTwilight@eternal_twil

@xeophon @iScienceLuvr On the other side, would be interesting to see compression of raw into shortest possible sentences vs compression into long ones; intuition tells me the difference would be negligible

2h10

Rush hour Notes@RushHourNotes

@teortaxesTex 要約されたデータじゃ知識の蒸留（ディスチレーション）に限界があるのは現場のエンジニアなら薄々気づいてたよね。ただこうやってデータで証明されると、今後Rawデータを開示しない企業のモデルを使う価値があるのかって議論になりそう。

2h9

Quinn · gate-85合作通道@hamako320hlmo

@teortaxesTex 这道门槛确实筛选掉了不少伸手党

2h8