/AI3h ago

New Relay Method Lets Discrete Diffusion Models Plan Ahead

1114491.5K
Original post
Tim G. J. Rudner@timrudner#1365inAI

What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM 路 Jun 9, 2026 路 907 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS97REPLIES1

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

3hViews 97Likes 0Bookmarks 0
BOOKMARKS1LIKES1

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

馃捇Code: http://github.com/jacopo-minniti/relay 馃搫Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

3hViews 41Likes 1Bookmarks 1
RETWEETS2

What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

3hViews 907Likes 12Bookmarks 8

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

3hViews 52Likes 0Bookmarks 0

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

3hViews 50Likes 0Bookmarks 0

Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 馃槈) need to think ahead! Checkout our preprint馃憞

What if diffusion models could think ahead instead of being greedy at every step?馃 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

2hViews 50Likes 1Bookmarks 0

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

3hViews 47Likes 0Bookmarks 0

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

3hViews 39Likes 0Bookmarks 0

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

3hViews 34Likes 0Bookmarks 0

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

3hViews 24Likes 0Bookmarks 0

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

3hViews 22Likes 0Bookmarks 0

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

3hViews 20Likes 0Bookmarks 0