/Tech20h ago

Songlin Yang releases ReplaySSM, cutting hybrid SSM memory traffic in half to enable 2x faster speculative decoding

The method yields 1.43x overall performance gains.

124555424065.9K

#112

Original post

Tri Dao@tri_dao#112inTech

As hybrid models (Qwen 3.5 / Nemotron Ultra) run agents with massive context, Gated-DeltaNet / Mamba states become a bottleneck. A simple insight to make this 2x faster: load the states, compute, but don't store them. This recompute trick finally unlocks spec decoding for SSMs

Ze-Wei (Johnny) Liou@zwljohnny

Why do we store the SSM state at all? More and more models are hybrids (Nemotron-3, Qwen3.5), so SSM decode speed matters.

We only write it back every step so the next step can read it. ReplaySSM caches the recent inputs instead and rebuilds the state on the fly.

Same outputs, half the memory traffic → ~2x on spec decode at large batch sizes, which barely even helped SSMs before → up to 1.43x standard decode on large hybrids (up to Nemotron-Ultra-550B)

Work with @tri_dao

Blog + Code👇

6:49 AM · Jun 15, 2026 · 31K Views

Sentiment

Users praised ReplaySSM for rebuilding SSM states from cached inputs to halve memory traffic, calling the work nice and really cool.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS6LIKES16

Ze-Wei (Johnny) Liou@zwljohnny

[1/N] Blog post: https://dao-lab.ai/blog/2026/replayssm/ https://tridao.me/blog/2026/replayssm/ Code: https://github.com/Johnny-Liou/ReplaySSM

20h1.2K166

RETWEETS18

Ze-Wei (Johnny) Liou@zwljohnny

Why do we store the SSM state at all? More and more models are hybrids (Nemotron-3, Qwen3.5), so SSM decode speed matters.

We only write it back every step so the next step can read it. ReplaySSM caches the recent inputs instead and rebuilds the state on the fly.

Same outputs, half the memory traffic → ~2x on spec decode at large batch sizes, which barely even helped SSMs before → up to 1.43x standard decode on large hybrids (up to Nemotron-Ultra-550B)

Work with @tri_dao

Blog + Code👇

20h35.2K13977

REPLIES2

Vipul Sharma@VipulS_1

@zwljohnny ? Snakes and Ladders: Accelerating State Space Model Inference with Speculative Decoding

https://assets.amazon.science/45/72/4848937d41b0b152ab24f1ca7d41/snakes-and-ladders-accelerating-state-space-model-inference-with-speculative-decoding.pdf

17h2131

Ze-Wei (Johnny) Liou@zwljohnny

[2/N] In hybrid models, the SSM layers dominate latency up to 100K tokens (not attention). The SSM layer is deeply memory-bound. Most of its latency comes from memory traffic, loading and storing the state every step, not actual computation.

20h79441

Ze-Wei (Johnny) Liou@zwljohnny

[3/N] Speculative decoding doesn’t work well on SSMs, and the reason is rollback.

When a draft token gets rejected you have to undo it. In a Transformer, you just trivially move pointer back in the KV cache. In SSM, the state is irreversible (raw inputs are unrecoverable), so vLLM has to keep a separate state for every single draft token just to be able to roll back.

20h51251

Ze-Wei (Johnny) Liou@zwljohnny

[6/N] ReplaySSM works for both standard and speculative decoding, and it generalizes to GDN. We built it on vLLM and tested at serving batch sizes, 4B to 550B. Standard decoding gets up to 1.48x, and speculative decoding hits 1.96x. The speculative decoding result is particularly important because vLLM's current SSM speculative decode is actually slower than just decoding token-by-token.

20h47121

Ze-Wei (Johnny) Liou@zwljohnny

[4/N] ReplaySSM addresses both with a simple idea: instead of storing the state every step, it caches recent inputs in a small buffer and only updates the state when the buffer fills.

Same output, but state traffic roughly halves. Each step still loads the state once, but instead of writing the whole state back it just appends the new inputs to the buffer. Rollback also comes for free, because we explicitly cache the recent inputs and rollback becomes a pointer move.

20h4733

Ze-Wei (Johnny) Liou@zwljohnny

[5/N] ReplaySSM also changes what must be produced. Before, states and outputs were both needed. Now, most steps only need the outputs.

So we compute the output directly from the buffer and never materialize the state at all, which is what the figure below shows for a single decode step. Speculative decoding is where this really pays off, since the same output-only form removes the serial state dependence that made SSMs hard to parallelize and turns the whole draft verification into GEMMs.

20h5513

Riccardo Grazzi@riccardograzzi

@zwljohnny Cool! Is this conceptually similar to the chunkwise parallel form but applied to decoding instead of training?

17h1582

Aditya Tomar@adityastomar_

@zwljohnny for the flush route, I think you can directly compute the new state via a single GEMM by assembling the v_t and k_t vectors into two matrices and multiplying them. This computes the sum of rank-1 outer products in a single GEMM rather than recurrently. Are you already doing this?

18h2651

Ze-Wei (Johnny) Liou@zwljohnny

Hi, previous rollback approaches still materialize and write at least one recurrent state back to HBM at every step. Committing that state each step makes them still suffer from sequential state dependency, since the state for the last committed tokens has to be rebuilt before the current drafts are verified.

17h1781

Ze-Wei (Johnny) Liou@zwljohnny

@adityastomar_ Yes, in flush route, both state and output are needed, so we multiply VK^T for the state, and read the output by multiplying with q_t.

In most of the step, we directly calculate the output (we don't need the state)

18h2312

Ze-Wei (Johnny) Liou@zwljohnny

The key idea is that storing the updated state is the natural design choice, but SSM actually has the flexibility to store recent input (k and v) or store the states.

The output-only route is a benefit from the concept. This part is more similar to chunkwise parallel training (outputs within a chunk are computed from a shared initial state)

16h1311

chi@chimcis

@zwljohnny really cool work!

18h232

Liran Ringel@liranringel

@zwljohnny Nice! 👏🏻👏🏻

19h227

Patrick Baitman@taofanqq

@tri_dao Flashrecompute

11h132

Ferbin@Ferbin08

@tri_dao Latency-critical systems like trading bots live or die on this tradeoff. Do you have numbers on the actual end-to-end inference latency you're hitting at scale?

19h116

Ze-Wei (Johnny) Liou@zwljohnny

@VipulS_1 The lack of parallelism and the state materialization cost become more pronounced in GDN and at larger batch sizes, where the benefit of amortizing weight loads becomes smaller.

17h20

Alfred Wu@AlfredWu270520

@zwljohnny @VipulS_1 Thanks @VipulS_1 for mentioning our paper! @zwljohnny I am also wondering whether the replay scheme handles prefix caching in vLLM since we might be storing the stale state for prefix caching if our state is not most up to date.

9h11

engineer cat 🐈@MLCatttt

@tri_dao so it is trading a bit of recompute for skipping the state writeback to HBM, on the bet that for SSMs the memory traffic was the real cost. is that the right read

7h3