/Tech11h ago

Study finds pruning larger LLMs outperforms training smaller models from scratch, even with extra training tokens

The performance advantages extend beyond simple parameter initialization.

172723023524K

#741

Original post

Yufeng (Felix) Xu@Zephyr271828

You want a strong small LLM. Would you start small — or inherit from something bigger?

📄 New paper: Small LLMs: Pruning vs. Training from Scratch

We find that pruning is more than a better initialization: simply giving randomly initialized LLMs more training tokens is often not enough to catch up.

12:00 PM · Jun 23, 2026 · 19.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Rethinking the Value of Network Pruning

ARXIV.ORGVia

#802

Posts from X

Most Activity

VIEWS5.1KBOOKMARKS15LIKES46REPLIES3

Zhuang Liu@liuzhuang1234

Years ago, in 2018, we pushed back on the hype around network pruning in the ConvNet era.

The question is only bigger now that everyone wants small LLMs.

Yufeng (Felix) Xu@Zephyr271828

You want a strong small LLM. Would you start small — or inherit from something bigger?

📄 New paper: Small LLMs: Pruning vs. Training from Scratch

We find that pruning is more than a better initialization: simply giving randomly initialized LLMs more training tokens is often not enough to catch up.

5h5.1K4615

RETWEETS30

Yufeng (Felix) Xu@Zephyr271828

You want a strong small LLM. Would you start small — or inherit from something bigger?

📄 New paper: Small LLMs: Pruning vs. Training from Scratch

We find that pruning is more than a better initialization: simply giving randomly initialized LLMs more training tokens is often not enough to catch up.

12h19.1K240226

Yufeng (Felix) Xu@Zephyr271828

🔍 Pruning outperforms training from scratch when training token budget is equal.

We pretrain a Llama3-8B model on 200B tokens, prune it, and train the small LLM on 50B tokens.

Comparing with a small LLM with same architecture and 50B training token, pruning initialized models always perform better.

12h1.3K107

Yufeng (Felix) Xu@Zephyr271828

We open-source our collection of all pruning methods implementation, GPU & TPU training and evaluation code.

Thanks @TaimingLu, Kunjun Li, @JiachenAI, @_mingjiesun, and @liuzhuang1234 for the support!

arXiv: http://arxiv.org/abs/2606.14150 code: http://github.com/zlab-princeton/pruning-vs-scratch

12h33887

Yufeng (Felix) Xu@Zephyr271828

🔬 We apply 6 pruning methods across 3 different granularities: depth (Minitron-depth), width (Minitron-width, FLAP, Sheared Llama), and sparse pruning (SparseGPT, Wanda).

We compare the performance of initializing by pruning vs training from scratch with equal training tokens.

12h46462

Zhuang Liu@liuzhuang1234

The paper from back then: "Rethinking the Value of Network Pruning" (ICLR 2019)

http://arxiv.org/abs/1810.05270

Same question, LLM now.

Zhuang Liu@liuzhuang1234

Years ago, in 2018, we pushed back on the hype around network pruning in the ConvNet era.

The question is only bigger now that everyone wants small LLMs.

5h65843

Yufeng (Felix) Xu@Zephyr271828

Recommendations:

💡 With a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch. 💡 When the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary

12h3056

Yufeng (Felix) Xu@Zephyr271828

🔍 As pruning ratio increases, the advantage of depth pruning diminishes and eventually vanishes, while the advantage of width pruning is stable.

Depth pruning is sensitive to pruning ratio, while width pruning is robust.

12h2985

Yufeng (Felix) Xu@Zephyr271828

🔍 Sometimes the advantage of pruning cannot be bridged with more training tokens.

When models are the trained from scratch with the whole token budget used in pretraining-pruning-retraining pipeline, it still underperforms methods with finer granularities.

This suggests that pruning is more than data transfer; it transfers the advantages of learning at a larger scale.

12h2585

Yufeng (Felix) Xu@Zephyr271828

🔍 Across the structured pruning methods we examined, the advantage of pruning over training from scratch diminishes as training token budget increases, but the gap does not close.

12h3294

Amil Dravid@_AmilDravid

Thanks for sharing this paper - very useful for the community! Curious if you’ve tried pruning during training of the larger model. I ran some experiments asking whether this can beat training the small model from scratch under the same end-to-end compute/token budget. I couldn’t get it to beat the naive scratch baseline, so I’d be curious to hear your thoughts.

11h2132

ueaj@_ueaj

@Zephyr271828 How does it compare to distilling?

9h4081

Kirito (e/acc) 🏴‍☠️@bronzeagepapi

@Zephyr271828 @0xSero

6h56

Irving@ieqr_

@Zephyr271828 Nice to see such a great paper!!! The results are as intuition and empirical results point to. Makes me remember of the ModernBERT recipe if I'm not mistaken

5h42

cordivai | Machine Learning & AI@cordivai

@Zephyr271828 Good framing for LLM research work. The practical part is not just trying a stronger model, but logging baselines, data splits, task-specific metrics, and failure cases so the result is reproducible.

5h34

combin8@combin8or

@Zephyr271828 @TaimingLu @JiachenAI @_mingjiesun @liuzhuang1234 Really interesting work, though your repo link 404s currently

10h27

Thomas Gatliff@TGatliff

@Zephyr271828 @TaimingLu @JiachenAI @_mingjiesun @liuzhuang1234 Someone really needs to prune the GLM 5.2 so that it only focuses on programming related tasks and gets rid of everything else. I’m really growing tired of LLMs pretending to know everything models.

6h17

Matias@Bolmercl

@Zephyr271828 @0xSero

5h2

That Guy^@IsaacLewisxhha

@Zephyr271828 Try this. Check my page , I shared a great paper I think will help.

Gabe Astrobot@robodadg

@liuzhuang1234 2018 was ConvNet compression. LLM pruning inherits structure that scratch has to rediscover. equal-cost gap confirms it.