/AI23h ago

Transformer Study Shows Value Vectors Read Original Tokens in Deep Layers

3313185.9K

#235

Original post

Songlin Yang#235

Wuxxcc@YuchenL52766559

A quick follow up to our paper. We found that in the deep layers of a transformer, the value vectors don't need the residual stream. The token's own identity is enough to produce them. The question we kept coming back to was whether the same holds for the queries and keys, or only the value.

So we ran a small experiment. We took an attention residual model and let q, k, and v each learn, on their own and per layer, which earlier layer to read from. Then we just looked at what each one picked.

They split. In the deep layers the value keeps reading the original token, while the query and key move on to the recent layers, the part of the stream that actually carries context. Which makes sense. q and k decide what attends to what, so they need the context. The value is just the content that gets moved, and by then the token alone does the job. It is also why we only swap out the value and leave q and k alone.

One more thing we noticed but aren't sure about yet. When the value does reach back to earlier layers, it mostly skips the attention outputs and reads from the MLP outputs instead. The MLP works on each token on its own, while attention is where tokens get mixed together, so it fits the same picture. The value goes for what a token carries by itself and stays away from where the context comes in. Still early, but we thought it was a fun one.

10:28 PM · Jun 3, 2026 · 6.1K Views

/AI23h ago

Transformer Study Shows Value Vectors Read Original Tokens in Deep Layers

--0--

#235

Original post

Songlin Yang#235

Wuxxcc@YuchenL52766559

10:28 PM · Jun 3, 2026 · 6.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

catid@MrCatid

@YuchenL52766559 I would not interpret it as “producing them” but rather finding more utility in having access to the original per token embeddings for the purpose of choosing the next token to emit when making that choice. Seen several works come to this conclusion.

13h160

Posts from X

Most Activity

No ranked X posts are available for this story yet.