Gated DeltaNet-2 separates channel-wise erase and write gates within linear attention, raising S-NIAH-3 scores from 63 to 90 on 1.3B models trained on 100B tokens

QUOTE POST

Gated DeltaNet has been one of my favorite "hybrid attention" newcomers in the good old transformer stack. Excited to see Gated DeltaNet-2. Adding it to my reading stack. In the meantime, I have a primer on Gated DeltaNet here: https://magazine.sebastianraschka.com/i/177848019/26-gated-deltanet

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2 #LinearAttention #StateSpaceModels #Mamba #LLM

10:17 PM · May 21, 2026 · 46.8K Views

11:10 PM · May 21, 2026 · 20.9K Views

REPLY

#155Sebastian Raschka@RASBT

PS: it can be found in recent Qwen models since Qwen3-Next

Sebastian Raschka@rasbt

Gated DeltaNet has been one of my favorite "hybrid attention" newcomers in the good old transformer stack. Excited to see Gated DeltaNet-2. Adding it to my reading stack. In the meantime, I have a primer on Gated DeltaNet here: https://magazine.sebastianraschka.com/i/177848019/26-gated-deltanet

11:10 PM · May 21, 2026 · 20.9K Views

11:12 PM · May 21, 2026 · 3K Views

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

> Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Expected from this change. Maybe hybrids with GDN-2 layers are highly promising.

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2 #LinearAttention #StateSpaceModels #Mamba #LLM

10:17 PM · May 21, 2026 · 46.8K Views

10:45 PM · May 21, 2026 · 2.7K Views

QUOTE POST

#716elie@ELIEBAKOUCH

gated deltanet 2 compared to previous linear attention methods (kimi delta attention, gated deltanet, mamba2)

each new variant adds finer control over what to decay, erase, and write in the state matrix

Ali Hatamizadeh@ahatamiz1

Gated DeltaNet-2 is here. 🚀 🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆 💡 Here's the idea behind it: Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn't just what to forget, it's how to edit that memory without scrambling the associations already in it. Prior delta-rule models like Gated DeltaNet and KDA use one scalar gate to do two jobs at once: erasing old content and writing new content. But these two decisions act on different axes of the state, so tying them together is a real limitation. Gated DeltaNet-2 decouples them. ✂️ a channel-wise erase gate b_t picks which key-side coordinates to read and remove ✍️ a channel-wise write gate w_t picks which value-side coordinates to commit 🔁 recovers KDA when both gates collapse to a scalar, and Gated DeltaNet when the decay collapses too ⚡ still trains fast: chunkwise WY algorithm with gate-aware backward, fused in Triton 📊 Results: We train 1.3B models on 100B tokens of FineWeb-Edu, matched in recurrent state size, against Mamba-2, Gated DeltaNet, KDA, and Mamba-3. Best average on language modeling + commonsense reasoning, in both recurrent and hybrid settings Biggest gains on long-context RULER retrieval. S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38 Joint work with @YejinChoinka and @jankautz. 📄 Paper: https://shorturl.at/AAlVb 💻 Code: https://github.com/NVlabs/GatedDeltaNet-2 #LinearAttention #StateSpaceModels #Mamba #LLM

10:17 PM · May 21, 2026 · 46.8K Views

11:49 PM · May 21, 2026 · 8K Views

REPLY

#716elie@ELIEBAKOUCH

updated viz with RWKV-7 and longcat flash linear attention

elie@eliebakouch

gated deltanet 2 compared to previous linear attention methods (kimi delta attention, gated deltanet, mamba2) each new variant adds finer control over what to decay, erase, and write in the state matrix

11:49 PM · May 21, 2026 · 8K Views

12:57 AM · May 22, 2026 · 488 Views

REPLY

#716elie@ELIEBAKOUCH

thanks @norxornor and @Grad62304977

elie@eliebakouch

updated viz with RWKV-7 and longcat flash linear attention

12:57 AM · May 22, 2026 · 488 Views

12:58 AM · May 22, 2026 · 361 Views

Gated DeltaNet-2 separates channel-wise erase and write gates within linear attention, raising S-NIAH-3 scores from 63 to 90 on 1.3B models trained on 100B tokens

Sentiment

Cluster engagement