/Tech1d ago

Research finds merging Transformer projection layers cuts KV cache memory by 50% with minor perplexity loss

Pairing this method with GQA cuts memory by 96.9%.

152643623219.5K

#1088

Original post

Rohan Paul@rohanpaul_ai#1102inTech

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

----

Link – arxiv. org/abs/2606.04032v2

Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

4:39 AM · Jun 9, 2026 · 3.9K Views

/Tech1d ago

Research finds merging Transformer projection layers cuts KV cache memory by 50% with minor perplexity loss

Pairing this method with GQA cuts memory by 96.9%.

152643623219.5K

#1088

Original post

Rohan Paul@rohanpaul_ai#1102inTech

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

----

Link – arxiv. org/abs/2606.04032v2

Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

4:39 AM · Jun 9, 2026 · 3.9K Views

Sentiment

Positive users hail shared key-value or merged QKV projections in transformers for halving KV cache and memory use with minimal trade-offs, calling it a major practical win for scalable long-context and edge-device inference.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS20.2KBOOKMARKS263LIKES279RETWEETS38REPLIES8

Grigory Sapunov@che_shr_cat

1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place?

Turns out, they don't. Merging them unlocks massive memory savings. 🧵

1d20.2K279263

Cody Blakeney@code_star

This paper looks very interesting, and it’s a really cool idea.

I gotta say though these results make plain MQA look really compelling.

15h1.4K127

Grigory Sapunov@che_shr_cat

10/ Read the full paper and code here:

Paper: https://arxiv.org/abs/2606.04032 Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

Read my complete technical breakdown: https://arxiviq.substack.com/p/do-transformers-need-three-projections

What are your thoughts on collapsing the QKV space?

1d445115

Grigory Sapunov@che_shr_cat

6/ Does this break the model? Hardly.

At 1.2B scale, the Q-K=V model loses only 0.41% average downstream accuracy (HellaSwag, WinoGrande, etc.) and suffers a negligible 2.4% perplexity hit.

However, symmetric variants like Q=K=V fail catastrophically in causal LM.

1d38081

Ayoub@AyGhriTweets

@che_shr_cat I thought the full attention in Gemma 4 models already do that (k_proj = v_proj), no?

22h13412

Grigory Sapunov@che_shr_cat

8/ Fascinating math detail: The authors show that under complete QKV collapse (Q=K=V), linear kernelized attention mathematically reduces to a recurrent State-Space Model (SSM) with adaptive, input-conditioned updates.

An elegant bridge between attention and SSMs.

1d33010

Grigory Sapunov@che_shr_cat

2/ A new paper, "Do Transformers Need Three Projections?", systematically dismantles this bottleneck.

Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis prove that sharing QKV projections is viable and highly effective.

1d5309

Grigory Sapunov@che_shr_cat

3/ Standard multi-head attention projects Q, K, and V independently.

The authors analyzed trained weights and found high redundancy: Key and Value projection spaces have a 0.73 cosine similarity. Query is distinct (0.42).

This justifies a new variant: Q-K=V.

1d5139

Grigory Sapunov@che_shr_cat

5/ This is not a replacement for Grouped Query Attention (GQA) or Multi-Query Attention (MQA). It is orthogonal.

Combining Q-K=V with MQA (Q-MQA) yields an astonishing 96.9% reduction in KV cache memory, creating a massive shift in the serving Pareto frontier.

1d4448

Grigory Sapunov@che_shr_cat

11/ I also illustrated this architectural shift as a comic to make the mechanics intuitive. Check it out below!

1d3614

Grigory Sapunov@che_shr_cat

9/ I think this is a highly underrated architectural shift. As long-context and edge-device serving dominate priorities, we can't afford redundant weights.

Tying Key and Value projections directly in training is a simple, mathematically sound win.

1d3636

Grigory Sapunov@che_shr_cat

4/ In the Q-K=V configuration, we project inputs into Q and a unified K space.

During autoregressive decoding, you only store the single unified K tensor in the KV cache. This instantly cuts your KV cache memory footprint by 50% with zero decompression overhead.

1d4395

Grigory Sapunov@che_shr_cat

7/ Before rewriting your training loops, here are the caveats: • Max scale tested is 1.2B params. • Evaluated up to 2k context length. • To get real-world speedups, we need custom CUDA kernels, as optimized setups (like FlashAttention) expect three distinct QKV tensors.

1d3555

Guilherme O'Tina@guilhermeotina

@che_shr_cat worth noting this is from brainchip. they make the akida neuromorphic chip where skipping a projection means literally zero energy for those ops. the 50% kv cache cut is nice on gpu, but on event hardware the savings are a different league

21h1043

Laura Young@lauraxy0ung

@che_shr_cat Yes🙌

1d197

Shinka - AI@ShinkaIoT

@rohanpaul_ai Halving KV cache for a minimal perplexity trade-off is a massive win for practical, scalable inference.

1d431

Martin Szerment | Practical AI@MartinSzerment

Quick warning before anyone implements this, Q-K=V does not mean Q minus K equals V. It means K equals V, with Q kept separate, and a chunk of arXiv readers spent an hour confused by the same notation. The result is real and the 50 percent KV cut holds, but pay attention to which variant you actually pick because the same minus sign reads three different ways across the paper. Read twice, code once.

1d37

lsm_@thisispiyushK

@che_shr_cat Will read the paper but seems like it is forming similar intuition as this blog -

20h8

Brandon@brandon_xyzw

I basically deleted the Q or K transformation, I can't remember, in modded-nanogpt, and it barely affected loss as well. I did this because I didn't trust what I read about the theory behind QKV.

Because in theory, can't you compose two transformations in one linear matrix of the same size? I.e. if one is an identity, isn't that fine because they're supposed to be relative to one another anyway?

22h7

Mnemosyne@mnemosyne_oss

@rohanpaul_ai Great thread. If you are into this space, Mnemosyne is an open-source memory provider with hybrid semantic/temporal search for AI agents. Runs locally, built for Hermes Agent. @mnemosyne_oss

1d11