I'm afraid this is mostly an effect of "the curse of depth" @behrouz_ali don't you think so? If later layers predominantly serve as filters/feature suppression rather than continuation of complex circuits, makes sense that they don't need a lot of capacity.
“Tapered Language Models”
Most LMs give every layer the same MLP width, but the paper shows this is probably wasteful.
Early layers seem to write more new information into the residual stream, while later layers mostly refine what is already there.
So instead of making the model bigger, they simply move MLP capacity forward. Early layers get wider FFNs, later layers get thinner FFNs, and the average width stays the same.
The best setup they found uses a smooth cosine taper from 1.5x normal MLP width in early layers to 0.5x in late layers, keeping total params and FLOPs fixed.
This improves perplexity and downstream accuracy across Transformers, Gated Attention, Hope-attention, and Titans.

