/Tech4h ago

DeltaNet Parallelizes Chunk Training Using Matrix Ops And Triangular Solves

300015

Original post

9/ The erase factor makes each step depend on the previous one, so the naive version is sequential and slow on a GPU. DeltaNet runs it in chunks: within a chunk, it's matrix multiplication and a triangular solve, and only the state is carried over between chunks.

Ravid Shwartz Ziv@ziv_ravid

8/ β is the write strength, learned per token. At 1 it fully overwrites. Below 1 it writes more gently, which matters because keys aren't orthogonal, so a hard write along one key also perturbs the keys that overlap it.

3:18 PM · Jun 29, 2026 · 6 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS139BOOKMARKS2LIKES3RETWEETS1

Ravid Shwartz Ziv@ziv_ravid

I started a new blog, so you can find the full write-up there - https://www.the-information-bottleneck.com/p/editing-a-compressed-memory

Tagging the expert @ahatamiz1

4h13932

REPLIES1

Ravid Shwartz Ziv@ziv_ravid

10/ The rest are refinements that keep that chunked solve working:

Gated DeltaNet: a scalar decay before each write

KDA (Kimi Linear): a per-channel decay

Gated DeltaNet-2: separate erase and write gates

Ravid Shwartz Ziv@ziv_ravid

4h500

Ravid Shwartz Ziv@ziv_ravid

13/ The point is to manage a fixed-size memory better, not make it bigger, and none of it matches softmax's exact recall. That's why current models are hybrid: they use several linear blocks per full-attention block.

4h1011

Ravid Shwartz Ziv@ziv_ravid

12/ The results are about what you'd expect: small gains on language modeling, larger ones on long-context retrieval with many keys, where interference is worst. In the ablation, most of the improvement comes from the erase gate, not the write gate.

4h22

Ravid Shwartz Ziv@ziv_ravid

11/ GDN-2 splits the write strength into two channel-wise gates: one for erasing (key side), one for writing (value side). The forward pass is the same as KDA. The backward pass needs a new kernel, because the gate sits inside a sum over channels and can't be pulled out.

Ravid Shwartz Ziv@ziv_ravid

10/ The rest are refinements that keep that chunked solve working:

Gated DeltaNet: a scalar decay before each write

KDA (Kimi Linear): a per-channel decay

Gated DeltaNet-2: separate erase and write gates

4h400