><
Introducing ><former
Most transformers are rectangles◻️: every layer has the same width
But is that optimal?🤔
We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵
The architecture assigns a unique width to each individual layer
><
Introducing ><former
Most transformers are rectangles◻️: every layer has the same width
But is that optimal?🤔
We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵
Positive users highlight the enjoyable collaboration behind Variable-Width Transformers that cut compute and KV cache while improving loss.
No Digg Deeper questions have been answered for this story yet.
A very intuitive idea that works!
Introducing ><former
Most transformers are rectangles◻️: every layer has the same width
But is that optimal?🤔
We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵

It has been super fun working with @osieberling @tanshawn @rpanda89 Yury Polyanskiy and Yoon Kim!
📄 Paper: https://arxiv.org/abs/2606.18246
Introducing ><former
Most transformers are rectangles◻️: every layer has the same width
But is that optimal?🤔
We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵

@sanxing_chen @linluqiu greater-than-less-than-former

🔬 Across 200M → 2B LMs and a 3B/1B MoE, ><former (wider on the two ends and narrower in the middle ⌛️) achieves lower loss than parameter-matched constant-width baselines, with lower FLOPs and average layer width (which determines KV cache size & I/O cost for activations)

Scaling curves suggest this trend may persist, if not widen, as we scale up further ⤴️ According to these scaling curves, ><former can reach the 2B baseline’s loss with ~78% of the pre-training FLOPs

🚀 Takeaway: transformer scaling is not just about depth, width, and data. The *shape* of widths across layers is an under-explored design axis, and more studies here seem worth doing
And yes "><former" is probably a nightmare for search engine optimization but well 🤷

@zhaofeng_wu @linluqiu How is it pronounced?

@zhaofeng_wu @yzhang_cs ><

@zhaofeng_wu How do you pronounce this

@zhaofeng_wu Too late to re-coin them Slenderformers?

@zhaofeng_wu @osieberling u-net transformers

@zhaofeng_wu @sanxing_chen @linluqiu 🙌

@zhaofeng_wu Did you tried a "rhombus"? Also, too late for cat-whiskers-former?

The middle-narrow shape echoes what U-Nets and autoencoders converged on — high width at the edges where you handle raw tokens and unembedding, narrow in the middle where representations are most abstract. Curious whether the optimal pinch-point shifts with vocab size or stays anchored to relative layer depth.