Some of the more puzzling unpublished observations from our paper: deep attention layers hate the residual stream of V and love it for QK, but if it has to make a choice, it will satisfy V over QK.
Translated to finding: if we learn coefficients for residual stream xi and the initial token embedding x0 as two input streams to deep attention layers, the model will give the coefficient for x0 a much larger magnitude. This will mean dominating the input with context-free token information.
However, if we learn the coefficients for both at a more fine-grained level for Q, K, and V, the coefficients for x0 is near 0 for both QK, but huge for V.
This reveals two surprises. (1) QK needs context information and little original token information. And K does not need the same information as V does (despite some models tying them). (2) Between the two opposite needs, the model is clearly in favor of what benefits V, so V is deemed more important to the optimization goal.
These are just the tip of an iceberg, and transformers surely moves in mysterious ways. We will therefore embark on the second part of this journey and, for our next set of experiments, involve this lady (iykyk)...
Paper: https://github.com/RiddleHe/nanochat/blob/master/papers/bank_of_values.pdf