What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
The method raised HumanEval coding accuracy to 42.1%.
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.
The only information carried forward is which tokens are still masked---but no representational information is preserved.
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.
💻Code: http://github.com/jacopo-minniti/relay 📄Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/
Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!
Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).
Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.
Our idea: Pass learned continuous representations across denoising steps.
We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.
The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.
The only information carried forward is which tokens are still masked---but no representational information is preserved.
Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.
We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).
We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.
How do we enable models to learn to plan ahead?
We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.
Our idea: Pass learned continuous representations across denoising steps.
We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.
We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.
How do we enable models to learn to plan ahead?
We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.
Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 😉) need to think ahead! Checkout our preprint👇
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).
Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.
On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.
Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.
Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.
We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).
On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.
Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.
This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.
📄Paper: https://arxiv.org/abs/2605.22967 💻Code: http://github.com/jacopo-minniti/relay 📢Blog: https://iesl.cs.umass.edu/diffusion/blog/2026/relay/
Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!
The method raised HumanEval coding accuracy to 42.1%.
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models