/Tech2h ago

Hybrid Models Pair Linear Blocks With Full Attention For Fixed Memory

9511404

Original post

1/ I had some free time this weekend, so I worked through linear attention, from the delta rule up to Gated DeltaNet-2. It was kind of fun back and forth with Claude about different papers.

3:18 PM · Jun 29, 2026 · 158 Views

Sentiment

Users thank the researcher for the informative and nice blog post studying linear attention from the Delta Rule to Gated DeltaNet-2.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Editing a Compressed Memory

THE-INFORMATION-BOTTLENECK.COMVia

#741

Posts from X

Most Activity

VIEWS66RETWEETS1

Ravid Shwartz Ziv@ziv_ravid

I started a new blog, so you can find the full write-up there - https://www.the-information-bottleneck.com/p/editing-a-compressed-memory

Tagging the expert @ahatamiz1

Ravid Shwartz Ziv@ziv_ravid

13/ The point is to manage a fixed-size memory better, not make it bigger, and none of it matches softmax's exact recall. That's why current models are hybrid: they use several linear blocks per full-attention block.

2h6610

LIKES2

Ali Hatamizadeh@ahatamiz1

@ziv_ravid This is pretty informative. Thanks for the nice blog.

2h212

REPLIES1

Ravid Shwartz Ziv@ziv_ravid

12/ The results are about what you'd expect: small gains on language modeling, larger ones on long-context retrieval with many keys, where interference is worst. In the ablation, most of the improvement comes from the erase gate, not the write gate.

2h5710

Ravid Shwartz Ziv@ziv_ravid

2/ The whole line of work is about how to store memory in a single fixed-size matrix, and the hard part is editing that matrix without disturbing the rest.

Ravid Shwartz Ziv@ziv_ravid

1/ I had some free time this weekend, so I worked through linear attention, from the delta rule up to Gated DeltaNet-2. It was kind of fun back and forth with Claude about different papers.

2h5100

Sunny Sanyal@SunnySanyal9

@ziv_ravid Welcome to blogging :)

1h121

Ravid Shwartz Ziv@ziv_ravid

3/ We all know the regular softmax attention, where you keep all past keys and values, so recall is exact, but memory (to retrieve what happened in previous tokens) and compute grow with length.

Ravid Shwartz Ziv@ziv_ravid

2/ The whole line of work is about how to store memory in a single fixed-size matrix, and the hard part is editing that matrix without disturbing the rest.

2h3300

Ravid Shwartz Ziv@ziv_ravid

7/ You have three ways to handle that update (managing memory):

add: the old value stays, mixed into the new one

replace the matrix: the target is fixed, every other (previous) fact is gone

delta: only the target slot changes

2h27

Ravid Shwartz Ziv@ziv_ravid

9/ The erase factor makes each step depend on the previous one, so the naive version is sequential and slow on a GPU. DeltaNet runs it in chunks: within a chunk, it's matrix multiplication and a triangular solve, and only the state is carried over between chunks.

2h24

Ravid Shwartz Ziv@ziv_ravid

7/The delta rule (Widrow-Hoff 1960; DeltaNet 2024) reads what the key currently returns, then writes the difference toward the target:

S ← (I − β kkᵀ)S + β kvᵀ

When we have a new key, the read is ~0, and it acts as a plain add. When we have a key that is similar to the previous one, the old value is subtracted and replaced.

2h24

Ravid Shwartz Ziv@ziv_ravid

8/ β is the write strength, learned per token. At 1 it fully overwrites. Below 1 it writes more gently, which matters because keys aren't orthogonal, so a hard write along one key also perturbs the keys that overlap it.

2h24

Ravid Shwartz Ziv@ziv_ravid

10/ The rest are refinements that keep that chunked solve working:

Gated DeltaNet: a scalar decay before each write

KDA (Kimi Linear): a per-channel decay

Gated DeltaNet-2: separate erase and write gates

2h22

Ravid Shwartz Ziv@ziv_ravid

2h22

Ravid Shwartz Ziv@ziv_ravid

11/ GDN-2 splits the write strength into two channel-wise gates: one for erasing (key side), one for writing (value side). The forward pass is the same as KDA. The backward pass needs a new kernel, because the gate sits inside a sum over channels and can't be pulled out.

2h20

Ravid Shwartz Ziv@ziv_ravid

3/ Linear attention drops that and keeps one matrix, S = Σ kᵢvᵢᵀ, which is the same size at any length. The price of a fixed size is interference. Read a stored key and you get its value back, plus a bit of every other stored value.

Ravid Shwartz Ziv@ziv_ravid

3/ We all know the regular softmax attention, where you keep all past keys and values, so recall is exact, but memory (to retrieve what happened in previous tokens) and compute grow with length.

2h1200

Ravid Shwartz Ziv@ziv_ravid

6/ The case that breaks plain accumulation is a key that comes back with a different value. They call it memory. A passage says x=5, then later x=7. Both give nearly the same key. If you just add, the slot ends up holding both and the read averages them.

Ravid Shwartz Ziv@ziv_ravid

5/ Softmax avoids this: the exponential sharpens the scores, so mismatched keys contribute almost nothing. But exp(q·k) can't be split into a query part times a key part, which is what you'd need to keep a fixed summary. You get clean reads or a fixed state, not both.

2h1000

Ravid Shwartz Ziv@ziv_ravid

4/ The wanted part stays the same size, but the leakage is a sum over everything else, so it grows with the number of stored facts.

Ravid Shwartz Ziv@ziv_ravid

2h900

Ravid Shwartz Ziv@ziv_ravid

4/ The wanted part stays the same size, but the leakage is a sum over everything else, so it grows with the number of stored facts.

2h800

Ravid Shwartz Ziv@ziv_ravid

@ahatamiz1 Thanks! Please let me know if you have comments/errors

2h131

Ravid Shwartz Ziv@ziv_ravid

@SunnySanyal9 Thanks!

1h91