Interesting, this paper shows that Transformers may not need separate key and value projections to work well.
This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.
A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.
Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.
The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.
When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.
The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.
----
Link – arxiv. org/abs/2606.04032v2
Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"








