/AI10h ago

Transformer Attention Layers Prioritize Value Residual Stream Over Query-Key

4997714.7K

Original posts

Reposts

#238

Original post

Songlin Yang#238

Muyu He@HeMuyu0327

Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK.

Translated to finding: if we learn coefficients for residual stream xi and the initial token embedding x0 as two input streams to deep attention layers, the model will give the coefficient for x0 a much larger magnitude. This will mean dominating the input with context-free token information.

However, if we learn the coefficients for both at a more fine-grained level for Q, K, and V, the coefficients for x0 is near 0 for both QK, but huge for V.

This reveals two surprises. (1) QK needs context information and little original token information. And K does not need the same information as V does (despite some models tying them). (2) Between the two opposite needs, the model is clearly in favor of what benefits V, so V is deemed more important to the optimization goal.

These are just the tip of an iceberg, and transformers surely moves in mysterious ways. We will therefore embark on the second part of this journey and, for our next set of experiments, involve this lady (iykyk)...

Paper: https://github.com/RiddleHe/nanochat/blob/master/papers/bank_of_values.pdf

9:38 PM · Jun 1, 2026 · 4.7K Views

/AI10h ago

Transformer Attention Layers Prioritize Value Residual Stream Over Query-Key

--0--

Original posts

Reposts

#238

Original post

Songlin Yang#238

Muyu He@HeMuyu0327

Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK.

However, if we learn the coefficients for both at a more fine-grained level for Q, K, and V, the coefficients for x0 is near 0 for both QK, but huge for V.

Paper: https://github.com/RiddleHe/nanochat/blob/master/papers/bank_of_values.pdf

9:38 PM · Jun 1, 2026 · 4.7K Views

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

No ranked X posts are available for this story yet.