/AI3h ago

New Relay Method Lets Discrete Diffusion Models Plan Ahead

1114491.5K

#56

Original post

Tim G. J. Rudner@timrudner#1365inAI

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM · Jun 9, 2026 · 907 Views

/AI3h ago

New Relay Method Lets Discrete Diffusion Models Plan Ahead

1114491.5K

#56

Original post

Tim G. J. Rudner@timrudner#1365inAI

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM · Jun 9, 2026 · 907 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS97REPLIES1

Tim G. J. Rudner@timrudner

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

3h9700

BOOKMARKS1LIKES1

Tim G. J. Rudner@timrudner

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

💻Code: http://github.com/jacopo-minniti/relay 📄Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/

Tim G. J. Rudner@timrudner

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

3h4111

RETWEETS2

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

3h907128

Tim G. J. Rudner@timrudner

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

Tim G. J. Rudner@timrudner

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

3h5200

Tim G. J. Rudner@timrudner

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

Tim G. J. Rudner@timrudner

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

3h5000

Dhruvesh Patel @Neurips@_dhruveshp

Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 😉) need to think ahead! Checkout our preprint👇

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

2h5010

Tim G. J. Rudner@timrudner

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

Tim G. J. Rudner@timrudner

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

3h4700

Tim G. J. Rudner@timrudner

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

Tim G. J. Rudner@timrudner

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

3h3900

Tim G. J. Rudner@timrudner

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

Tim G. J. Rudner@timrudner

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

3h3400

Tim G. J. Rudner@timrudner

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

Tim G. J. Rudner@timrudner

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

3h2400

Tim G. J. Rudner@timrudner

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

Tim G. J. Rudner@timrudner

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

3h2200

Tim G. J. Rudner@timrudner

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

Tim G. J. Rudner@timrudner

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

3h2000