/Tech7h ago

Zhaofeng Wu releases ><former, a variable-width transformer that improves training loss while reducing KV cache size

The architecture assigns a unique width to each individual layer

13143156513.6K

#468

Original post

CLS@ChengleiSi#468inTech

Zhaofeng Wu@zhaofeng_wu

Introducing ><former

Most transformers are rectangles◻️: every layer has the same width

But is that optimal?🤔

We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵

3:13 PM · Jun 18, 2026 · 718 Views

Sentiment

Positive users highlight the enjoyable collaboration behind Variable-Width Transformers that cut compute and KV cache while improving loss.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.6K

Han Guo@HanGuo97

A very intuitive idea that works!

Zhaofeng Wu@zhaofeng_wu

Introducing ><former

Most transformers are rectangles◻️: every layer has the same width

But is that optimal?🤔

We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵

3h1.6K100

BOOKMARKS2LIKES12

Zhaofeng Wu@zhaofeng_wu

It has been super fun working with @osieberling @tanshawn @rpanda89 Yury Polyanskiy and Yoon Kim!

📄 Paper: https://arxiv.org/abs/2606.18246

9h365122

RETWEETS14

Zhaofeng Wu@zhaofeng_wu

Introducing ><former

Most transformers are rectangles◻️: every layer has the same width

But is that optimal?🤔

We propose variable-width transformers that have different widths across layers, improving loss while cutting compute & KV cache size 🧵

9h11.7K13768

REPLIES1

Zhaofeng Wu@zhaofeng_wu

@sanxing_chen @linluqiu greater-than-less-than-former

8h2415

Zhaofeng Wu@zhaofeng_wu

🔬 Across 200M → 2B LMs and a 3B/1B MoE, ><former (wider on the two ends and narrower in the middle ⌛️) achieves lower loss than parameter-matched constant-width baselines, with lower FLOPs and average layer width (which determines KV cache size & I/O cost for activations)

9h48881

Zhaofeng Wu@zhaofeng_wu

Scaling curves suggest this trend may persist, if not widen, as we scale up further ⤴️ According to these scaling curves, ><former can reach the 2B baseline’s loss with ~78% of the pre-training FLOPs

9h41381

Zhaofeng Wu@zhaofeng_wu

🚀 Takeaway: transformer scaling is not just about depth, width, and data. The *shape* of widths across layers is an under-explored design axis, and more studies here seem worth doing

And yes "><former" is probably a nightmare for search engine optimization but well 🤷

9h3658

Sanxing Chen@sanxing_chen

@zhaofeng_wu @linluqiu How is it pronounced?

8h285

Junlin Han@han_junlin

@zhaofeng_wu @yzhang_cs ><

7h1871

Aaron@aaronbatilo

@zhaofeng_wu How do you pronounce this

6h117

Bleys Goodson@bleysg

@zhaofeng_wu Too late to re-coin them Slenderformers?

5h114

handongxue@likev

@zhaofeng_wu @osieberling u-net transformers

5h88

Avi Trost@atrost3122

@zhaofeng_wu @sanxing_chen @linluqiu 🙌

7h31

Esteban@estebarb

@zhaofeng_wu Did you tried a "rhombus"? Also, too late for cat-whiskers-former?

4h17

AiDevCraft@AiDevCraft

The middle-narrow shape echoes what U-Nets and autoencoders converged on — high width at the edges where you handle raw tokens and unembedding, narrow in the middle where representations are most abstract. Curious whether the optimal pinch-point shifts with vocab size or stays anchored to relative layer depth.

5h2