/Tech1d ago

Research finds merging Transformer projection layers cuts KV cache memory by 50% with minor perplexity loss

Pairing this method with GQA cuts memory by 96.9%.

152643623219.5K
Original post
Rohan Paul@rohanpaul_ai#1102inTech

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

----

Link – arxiv. org/abs/2606.04032v2

Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

4:39 AM · Jun 9, 2026 · 3.9K Views
Sentiment

Positive users hail shared key-value or merged QKV projections in transformers for halving KV cache and memory use with minimal trade-offs, calling it a major practical win for scalable long-context and edge-device inference.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS20.2KBOOKMARKS263LIKES279RETWEETS38REPLIES8
Grigory Sapunov@che_shr_cat

1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place?

Turns out, they don't. Merging them unlocks massive memory savings. 🧵

1dViews 20.2KLikes 279Bookmarks 263
Cody Blakeney@code_star

This paper looks very interesting, and it’s a really cool idea.

I gotta say though these results make plain MQA look really compelling.

15hViews 1.4KLikes 12Bookmarks 7
Grigory Sapunov@che_shr_cat

10/ Read the full paper and code here:

Paper: https://arxiv.org/abs/2606.04032 Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

Read my complete technical breakdown: https://arxiviq.substack.com/p/do-transformers-need-three-projections

What are your thoughts on collapsing the QKV space?

1dViews 445Likes 11Bookmarks 5
Grigory Sapunov@che_shr_cat

6/ Does this break the model? Hardly.

At 1.2B scale, the Q-K=V model loses only 0.41% average downstream accuracy (HellaSwag, WinoGrande, etc.) and suffers a negligible 2.4% perplexity hit.

However, symmetric variants like Q=K=V fail catastrophically in causal LM.

1dViews 380Likes 8Bookmarks 1
Ayoub@AyGhriTweets

@che_shr_cat I thought the full attention in Gemma 4 models already do that (k_proj = v_proj), no?

22hViews 134Likes 1Bookmarks 2
Grigory Sapunov@che_shr_cat

8/ Fascinating math detail: The authors show that under complete QKV collapse (Q=K=V), linear kernelized attention mathematically reduces to a recurrent State-Space Model (SSM) with adaptive, input-conditioned updates.

An elegant bridge between attention and SSMs.

1dViews 330Likes 10
Grigory Sapunov@che_shr_cat

2/ A new paper, "Do Transformers Need Three Projections?", systematically dismantles this bottleneck.

Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis prove that sharing QKV projections is viable and highly effective.

1dViews 530Likes 9
Grigory Sapunov@che_shr_cat

3/ Standard multi-head attention projects Q, K, and V independently.

The authors analyzed trained weights and found high redundancy: Key and Value projection spaces have a 0.73 cosine similarity. Query is distinct (0.42).

This justifies a new variant: Q-K=V.

1dViews 513Likes 9
Grigory Sapunov@che_shr_cat

5/ This is not a replacement for Grouped Query Attention (GQA) or Multi-Query Attention (MQA). It is orthogonal.

Combining Q-K=V with MQA (Q-MQA) yields an astonishing 96.9% reduction in KV cache memory, creating a massive shift in the serving Pareto frontier.

1dViews 444Likes 8
Grigory Sapunov@che_shr_cat

11/ I also illustrated this architectural shift as a comic to make the mechanics intuitive. Check it out below!

1dViews 361Likes 4
Grigory Sapunov@che_shr_cat

9/ I think this is a highly underrated architectural shift. As long-context and edge-device serving dominate priorities, we can't afford redundant weights.

Tying Key and Value projections directly in training is a simple, mathematically sound win.

1dViews 363Likes 6
Grigory Sapunov@che_shr_cat

4/ In the Q-K=V configuration, we project inputs into Q and a unified K space.

During autoregressive decoding, you only store the single unified K tensor in the KV cache. This instantly cuts your KV cache memory footprint by 50% with zero decompression overhead.

1dViews 439Likes 5
Grigory Sapunov@che_shr_cat

7/ Before rewriting your training loops, here are the caveats: • Max scale tested is 1.2B params. • Evaluated up to 2k context length. • To get real-world speedups, we need custom CUDA kernels, as optimized setups (like FlashAttention) expect three distinct QKV tensors.

1dViews 355Likes 5
Guilherme O'Tina@guilhermeotina

@che_shr_cat worth noting this is from brainchip. they make the akida neuromorphic chip where skipping a projection means literally zero energy for those ops. the 50% kv cache cut is nice on gpu, but on event hardware the savings are a different league

21hViews 104Likes 3
Laura Young@lauraxy0ung

@che_shr_cat Yes🙌

1dViews 197
Shinka - AI@ShinkaIoT

@rohanpaul_ai Halving KV cache for a minimal perplexity trade-off is a massive win for practical, scalable inference.

1dViews 43Likes 1

Quick warning before anyone implements this, Q-K=V does not mean Q minus K equals V. It means K equals V, with Q kept separate, and a chunk of arXiv readers spent an hour confused by the same notation. The result is real and the 50 percent KV cut holds, but pay attention to which variant you actually pick because the same minus sign reads three different ways across the paper. Read twice, code once.

1dViews 37
lsm_@thisispiyushK

@che_shr_cat Will read the paper but seems like it is forming similar intuition as this blog -

20hViews 8
Brandon@brandon_xyzw

I basically deleted the Q or K transformation, I can't remember, in modded-nanogpt, and it barely affected loss as well. I did this because I didn't trust what I read about the theory behind QKV.

Because in theory, can't you compose two transformations in one linear matrix of the same size? I.e. if one is an identity, isn't that fine because they're supposed to be relative to one another anyway?

22hViews 7
Mnemosyne@mnemosyne_oss

@rohanpaul_ai Great thread. If you are into this space, Mnemosyne is an open-source memory provider with hybrid semantic/temporal search for AI agents. Runs locally, built for Hermes Agent. @mnemosyne_oss

1dViews 1Likes 1