/AI23h ago

Transformer Study Shows Value Vectors Read Original Tokens in Deep Layers

--0--
Original postSonglin Yang#235
Wuxxcc@YuchenL52766559

A quick follow up to our paper. We found that in the deep layers of a transformer, the value vectors don't need the residual stream. The token's own identity is enough to produce them. The question we kept coming back to was whether the same holds for the queries and keys, or only the value.

So we ran a small experiment. We took an attention residual model and let q, k, and v each learn, on their own and per layer, which earlier layer to read from. Then we just looked at what each one picked.

They split. In the deep layers the value keeps reading the original token, while the query and key move on to the recent layers, the part of the stream that actually carries context. Which makes sense. q and k decide what attends to what, so they need the context. The value is just the content that gets moved, and by then the token alone does the job. It is also why we only swap out the value and leave q and k alone.

One more thing we noticed but aren't sure about yet. When the value does reach back to earlier layers, it mostly skips the attention outputs and reads from the MLP outputs instead. The MLP works on each token on its own, while attention is where tokens get mixed together, so it fits the same picture. The value goes for what a token carries by itself and stays away from where the context comes in. Still early, but we thought it was a fun one.

10:28 PM · Jun 3, 2026 · 6.1K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
No ranked X posts are available for this story yet.