Gated DeltaNet-2 separates channel-wise erase and write gates within linear attention, raising S-NIAH-3 scores from 63 to 90 on 1.3B models trained on 100B tokens

VIEWS42.4KRETWEETS51REPLIES19

Gated DeltaNet has been one of my favorite "hybrid attention" newcomers in the good old transformer stack. Excited to see Gated DeltaNet-2. Adding it to my reading stack. In the meantime, I have a primer on Gated DeltaNet here: https://magazine.sebastianraschka.com/i/177848019/26-gated-deltanet

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2

#LinearAttention #StateSpaceModels #Mamba #LLM

39d42.4K339199

BOOKMARKS207LIKES350

elie@eliebakouch

gated deltanet 2 compared to previous linear attention methods (kimi delta attention, gated deltanet, mamba2)

each new variant adds finer control over what to decay, erase, and write in the state matrix

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2

#LinearAttention #StateSpaceModels #Mamba #LLM

39d27.1K350207

BlinkDL@BlinkDL_AI

Gated DeltaNet-2 is almost exactly RWKV-7's DPLR recurrence, not acknowledging the elephant in the room 🙂

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2

#LinearAttention #StateSpaceModels #Mamba #LLM

39d17.6K11857

Sebastian Raschka@rasbt

PS: it can be found in recent Qwen models since Qwen3-Next

Sebastian Raschka@rasbt

Gated DeltaNet has been one of my favorite "hybrid attention" newcomers in the good old transformer stack. Excited to see Gated DeltaNet-2. Adding it to my reading stack. In the meantime, I have a primer on Gated DeltaNet here: https://magazine.sebastianraschka.com/i/177848019/26-gated-deltanet

39d5.7K3116

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Expected from this change. Maybe hybrids with GDN-2 layers are highly promising.

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here's the idea behind it:

Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it.

Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation.

Gated DeltaNet-2 decouples them.

✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton

📊 Results:

We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3.

Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38

Joint work with @YejinChoinka and @jankautz.

📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2

#LinearAttention #StateSpaceModels #Mamba #LLM

39d4.7K3710

elie@eliebakouch

updated viz with RWKV-7 and longcat flash linear attention

elie@eliebakouch

gated deltanet 2 compared to previous linear attention methods (kimi delta attention, gated deltanet, mamba2)

each new variant adds finer control over what to decay, erase, and write in the state matrix

39d1.5K214

Eric Alcaide@eric_alcaide

@ahatamiz1 This looks quite similar to rwkv7 doesn’t it 👀

38d951121

Elliot Arledge@elliotarledge

@ahatamiz1 reading now

39d2835

elie@eliebakouch

thanks @norxornor and @Grad62304977

elie@eliebakouch

updated viz with RWKV-7 and longcat flash linear attention

39d1.2K70

Torsten Scholak@tscholak

@eliebakouch Ok, but is it still fast enough to make that extra flexibility worth it?

38d721

Mike Erlihson, Math PhD, AI@MikeE_3_14

@ahatamiz1 Isn't this very similar to what lstm does?

39d3833

elie@eliebakouch

@norxornor here you go, not sure this is correct tho

39d351

nor@norxornor

@eliebakouch would be interesting to compare with rwkv-7, many architectural decisions across variants seem to be converging to it

39d181

Louis@Louis9687221579

@eliebakouch is this claude ? Seem like claude like to write a weird version of GDN2. This was taken directly from appendix which highlight how similar it's to RWKV-7 and FG2-GDN

38d45

nor@norxornor

nice! so the difference in gdn-2 seems to be that the keys are coupled, and the extra element-wise product of w and v is like an mlp (so unless that behavior is doing something special (but rwkv also uses value residual), it seems to be effectively a subset of rwkv if we absorb other stuff appropriately)

39d29

Oleksii Halahan 🇺🇦🇪🇺@raspbfox

@ahatamiz1 How does it compare to Rwkv?

38d833

Dante@thedntx

@ahatamiz1 linear attention finally getting the taste of erase-write separation its about time

39d2522

Jeremy Howard@jeremyphoward

@ahatamiz1 Love it! Getting closer and closer to an LSTM ;)

39d2432

Wario@atrasdoarwario

@ahatamiz1 I'd be interested in seeing how Mamba-3 would behave with expansion=1 (this only helps to save diskspace..), larger d_states and lower d_model. d_state at 64 seems quite low. But I could be wrong that this would fare better.

38d1492

Aaryan Kakad@aaryan_kakad

@ahatamiz1 WHAT

crazy, need to study this asap

38d1372