1/ I had some free time this weekend, so I worked through linear attention, from the delta rule up to Gated DeltaNet-2. It was kind of fun back and forth with Claude about different papers.
Users thank the researcher for the informative and nice blog post studying linear attention from the Delta Rule to Gated DeltaNet-2.
No Digg Deeper questions have been answered for this story yet.
Most Activity
I started a new blog, so you can find the full write-up there - https://www.the-information-bottleneck.com/p/editing-a-compressed-memory
Tagging the expert @ahatamiz1
13/ The point is to manage a fixed-size memory better, not make it bigger, and none of it matches softmax's exact recall. That's why current models are hybrid: they use several linear blocks per full-attention block.

@ziv_ravid This is pretty informative. Thanks for the nice blog.
13/ The point is to manage a fixed-size memory better, not make it bigger, and none of it matches softmax's exact recall. That's why current models are hybrid: they use several linear blocks per full-attention block.
12/ The results are about what you'd expect: small gains on language modeling, larger ones on long-context retrieval with many keys, where interference is worst. In the ablation, most of the improvement comes from the erase gate, not the write gate.
2/ The whole line of work is about how to store memory in a single fixed-size matrix, and the hard part is editing that matrix without disturbing the rest.
1/ I had some free time this weekend, so I worked through linear attention, from the delta rule up to Gated DeltaNet-2. It was kind of fun back and forth with Claude about different papers.

@ziv_ravid Welcome to blogging :)
3/ We all know the regular softmax attention, where you keep all past keys and values, so recall is exact, but memory (to retrieve what happened in previous tokens) and compute grow with length.
2/ The whole line of work is about how to store memory in a single fixed-size matrix, and the hard part is editing that matrix without disturbing the rest.

7/ You have three ways to handle that update (managing memory):
add: the old value stays, mixed into the new one
replace the matrix: the target is fixed, every other (previous) fact is gone
delta: only the target slot changes

9/ The erase factor makes each step depend on the previous one, so the naive version is sequential and slow on a GPU. DeltaNet runs it in chunks: within a chunk, it's matrix multiplication and a triangular solve, and only the state is carried over between chunks.

7/The delta rule (Widrow-Hoff 1960; DeltaNet 2024) reads what the key currently returns, then writes the difference toward the target:
S ← (I − β kkᵀ)S + β kvᵀ
When we have a new key, the read is ~0, and it acts as a plain add. When we have a key that is similar to the previous one, the old value is subtracted and replaced.

8/ β is the write strength, learned per token. At 1 it fully overwrites. Below 1 it writes more gently, which matters because keys aren't orthogonal, so a hard write along one key also perturbs the keys that overlap it.

10/ The rest are refinements that keep that chunked solve working:
Gated DeltaNet: a scalar decay before each write
KDA (Kimi Linear): a per-channel decay
Gated DeltaNet-2: separate erase and write gates

12/ The results are about what you'd expect: small gains on language modeling, larger ones on long-context retrieval with many keys, where interference is worst. In the ablation, most of the improvement comes from the erase gate, not the write gate.

11/ GDN-2 splits the write strength into two channel-wise gates: one for erasing (key side), one for writing (value side). The forward pass is the same as KDA. The backward pass needs a new kernel, because the gate sits inside a sum over channels and can't be pulled out.
3/ Linear attention drops that and keeps one matrix, S = Σ kᵢvᵢᵀ, which is the same size at any length. The price of a fixed size is interference. Read a stored key and you get its value back, plus a bit of every other stored value.
3/ We all know the regular softmax attention, where you keep all past keys and values, so recall is exact, but memory (to retrieve what happened in previous tokens) and compute grow with length.
6/ The case that breaks plain accumulation is a key that comes back with a different value. They call it memory. A passage says x=5, then later x=7. Both give nearly the same key. If you just add, the slot ends up holding both and the read averages them.
5/ Softmax avoids this: the exponential sharpens the scores, so mismatched keys contribute almost nothing. But exp(q·k) can't be split into a query part times a key part, which is what you'd need to keep a fixed summary. You get clean reads or a fixed state, not both.
4/ The wanted part stays the same size, but the leakage is a sum over everything else, so it grows with the number of stored facts.
3/ Linear attention drops that and keeps one matrix, S = Σ kᵢvᵢᵀ, which is the same size at any length. The price of a fixed size is interference. Read a stored key and you get its value back, plus a bit of every other stored value.
5/ Softmax avoids this: the exponential sharpens the scores, so mismatched keys contribute almost nothing. But exp(q·k) can't be split into a query part times a key part, which is what you'd need to keep a fixed summary. You get clean reads or a fixed state, not both.
4/ The wanted part stays the same size, but the leakage is a sum over everything else, so it grows with the number of stored facts.

@ahatamiz1 Thanks! Please let me know if you have comments/errors

@SunnySanyal9 Thanks!