As hybrid models (Qwen 3.5 / Nemotron Ultra) run agents with massive context, Gated-DeltaNet / Mamba states become a bottleneck. A simple insight to make this 2x faster: load the states, compute, but don't store them. This recompute trick finally unlocks spec decoding for SSMs
Why do we store the SSM state at all? More and more models are hybrids (Nemotron-3, Qwen3.5), so SSM decode speed matters.
We only write it back every step so the next step can read it. ReplaySSM caches the recent inputs instead and rebuilds the state on the fly.
Same outputs, half the memory traffic → ~2x on spec decode at large batch sizes, which barely even helped SSMs before → up to 1.43x standard decode on large hybrids (up to Nemotron-Ultra-550B)
Work with @tri_dao
Blog + Code👇









