/Tech16h ago

Merging Transformer QKV projections cuts KV cache memory by 50% with a 3.1% perplexity increase

Combining this with GQA reduces KV cache by 96.9%.

71872916813.1K

#847

Original post

Rohan Paul@rohanpaul_ai#847inTech

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

----

Link – arxiv. org/abs/2606.04032v2

Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

4:39 AM · Jun 9, 2026 · 2.5K Views

/Tech16h ago

Merging Transformer QKV projections cuts KV cache memory by 50% with a 3.1% perplexity increase

Combining this with GQA reduces KV cache by 96.9%.

71872916813.1K

#847

Original post

Rohan Paul@rohanpaul_ai#847inTech

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

----

Link – arxiv. org/abs/2606.04032v2

Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

4:39 AM · Jun 9, 2026 · 2.5K Views

Sentiment

Users praise merging QKV projections in transformers because it halves KV cache size with minimal trade-offs, enabling more scalable and efficient inference.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS10.8KBOOKMARKS150LIKES144RETWEETS22REPLIES4

Grigory Sapunov@che_shr_cat

1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do Transformers need three separate Q, K, and V projections in the first place?

Turns out, they don't. Merging them unlocks massive memory savings. 🧵

10h10.8K144150

Grigory Sapunov@che_shr_cat

10/ Read the full paper and code here:

Paper: https://arxiv.org/abs/2606.04032 Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

Read my complete technical breakdown: https://arxiviq.substack.com/p/do-transformers-need-three-projections

What are your thoughts on collapsing the QKV space?

10h445115

Grigory Sapunov@che_shr_cat

6/ Does this break the model? Hardly.

At 1.2B scale, the Q-K=V model loses only 0.41% average downstream accuracy (HellaSwag, WinoGrande, etc.) and suffers a negligible 2.4% perplexity hit.

However, symmetric variants like Q=K=V fail catastrophically in causal LM.

10h38081

Ayoub@AyGhriTweets

@che_shr_cat I thought the full attention in Gemma 4 models already do that (k_proj = v_proj), no?

6h13412

Grigory Sapunov@che_shr_cat

8/ Fascinating math detail: The authors show that under complete QKV collapse (Q=K=V), linear kernelized attention mathematically reduces to a recurrent State-Space Model (SSM) with adaptive, input-conditioned updates.

An elegant bridge between attention and SSMs.

10h33010

Grigory Sapunov@che_shr_cat

2/ A new paper, "Do Transformers Need Three Projections?", systematically dismantles this bottleneck.

Ali Kayyam, Anusha Madan Gopal, and M Anthony Lewis prove that sharing QKV projections is viable and highly effective.

10h5309

Grigory Sapunov@che_shr_cat

3/ Standard multi-head attention projects Q, K, and V independently.

The authors analyzed trained weights and found high redundancy: Key and Value projection spaces have a 0.73 cosine similarity. Query is distinct (0.42).

This justifies a new variant: Q-K=V.

10h5139

Grigory Sapunov@che_shr_cat

5/ This is not a replacement for Grouped Query Attention (GQA) or Multi-Query Attention (MQA). It is orthogonal.

Combining Q-K=V with MQA (Q-MQA) yields an astonishing 96.9% reduction in KV cache memory, creating a massive shift in the serving Pareto frontier.

10h4448

Grigory Sapunov@che_shr_cat

11/ I also illustrated this architectural shift as a comic to make the mechanics intuitive. Check it out below!

10h3614

Grigory Sapunov@che_shr_cat

9/ I think this is a highly underrated architectural shift. As long-context and edge-device serving dominate priorities, we can't afford redundant weights.

Tying Key and Value projections directly in training is a simple, mathematically sound win.

10h3636

Grigory Sapunov@che_shr_cat

4/ In the Q-K=V configuration, we project inputs into Q and a unified K space.

During autoregressive decoding, you only store the single unified K tensor in the KV cache. This instantly cuts your KV cache memory footprint by 50% with zero decompression overhead.

10h4395

Grigory Sapunov@che_shr_cat

7/ Before rewriting your training loops, here are the caveats: • Max scale tested is 1.2B params. • Evaluated up to 2k context length. • To get real-world speedups, we need custom CUDA kernels, as optimized setups (like FlashAttention) expect three distinct QKV tensors.

10h3555

Guilherme O'Tina@guilhermeotina

@che_shr_cat worth noting this is from brainchip. they make the akida neuromorphic chip where skipping a projection means literally zero energy for those ops. the 50% kv cache cut is nice on gpu, but on event hardware the savings are a different league

5h1043

Laura Young@lauraxy0ung

@che_shr_cat Yes🙌

10h197

Shinka - AI@ShinkaIoT

@rohanpaul_ai Halving KV cache for a minimal perplexity trade-off is a massive win for practical, scalable inference.

16h431

Martin Szerment | Practical AI@MartinSzerment

Quick warning before anyone implements this, Q-K=V does not mean Q minus K equals V. It means K equals V, with Q kept separate, and a chunk of arXiv readers spent an hour confused by the same notation. The result is real and the 50 percent KV cut holds, but pay attention to which variant you actually pick because the same minus sign reads three different ways across the paper. Read twice, code once.

15h37

lsm_@thisispiyushK

@che_shr_cat Will read the paper but seems like it is forming similar intuition as this blog -

5h8

Brandon@brandon_xyzw

I basically deleted the Q or K transformation, I can't remember, in modded-nanogpt, and it barely affected loss as well. I did this because I didn't trust what I read about the theory behind QKV.

Because in theory, can't you compose two transformations in one linear matrix of the same size? I.e. if one is an identity, isn't that fine because they're supposed to be relative to one another anyway?

6h7

Mnemosyne@mnemosyne_oss

@rohanpaul_ai Great thread. If you are into this space, Mnemosyne is an open-source memory provider with hybrid semantic/temporal search for AI agents. Runs locally, built for Hermes Agent. @mnemosyne_oss

16h11