What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.
The only information carried forward is which tokens are still masked---but no representational information is preserved.
What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.
馃捇Code: http://github.com/jacopo-minniti/relay 馃搫Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/
Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!
What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!
Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).
Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.
Our idea: Pass learned continuous representations across denoising steps.
We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.
The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.
The only information carried forward is which tokens are still masked---but no representational information is preserved.
Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 馃槈) need to think ahead! Checkout our preprint馃憞
What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.
We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).
We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.
How do we enable models to learn to plan ahead?
We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.
Our idea: Pass learned continuous representations across denoising steps.
We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.
We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.
How do we enable models to learn to plan ahead?
We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.
Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).
Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.
On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.
Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.
Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.
We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).
On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.
Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.
What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models