/Tech5h ago

Tapered MLP Widths Improve Language Model Efficiency and Accuracy

4280124.3K

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

I'm afraid this is mostly an effect of "the curse of depth" @behrouz_ali don't you think so? If later layers predominantly serve as filters/feature suppression rather than continuation of complex circuits, makes sense that they don't need a lot of capacity.

alphaXiv@askalphaxiv

“Tapered Language Models”

Most LMs give every layer the same MLP width, but the paper shows this is probably wasteful.

Early layers seem to write more new information into the residual stream, while later layers mostly refine what is already there.

So instead of making the model bigger, they simply move MLP capacity forward. Early layers get wider FFNs, later layers get thinner FFNs, and the average width stays the same.

The best setup they found uses a smooth cosine taper from 1.5x normal MLP width in early layers to 0.5x in late layers, keeping total params and FLOPs fixed.

This improves perplexity and downstream accuracy across Transformers, Gated Attention, Hope-attention, and Titans.

8:44 AM · Jun 25, 2026 · 3.3K Views

Sentiment

Users find the perspective on tapered MLP widths for improving language model efficiency and accuracy interesting because it opens up new ideas and approaches.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS989LIKES7

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

reminds me of Qwen 3.5 trick with truncating the model to maximize value per layer. If you cut off the tapered part of this cosine stack, you get only the rich early layers…

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

5h98970

DeepReinforce@deep_reinforce

@teortaxesTex @behrouz_ali 🫡🫡DMed

5h8

黑芝麻小小85@contracostatp

@teortaxesTex @behrouz_ali 深度诅咒这视角确实很有意思思路打开了

5h2