/Tech1d ago

Tim G. J. Rudner, University of Toronto assistant professor, releases RELAY to help Masked Diffusion Models propagate continuous context

The method raised HumanEval coding accuracy to 42.1%.

11255192.8K

#60

Original post

Tim G. J. Rudner@timrudner#1491inTech

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM · Jun 9, 2026 · 2.4K Views

/Tech1d ago

Tim G. J. Rudner, University of Toronto assistant professor, releases RELAY to help Masked Diffusion Models propagate continuous context

The method raised HumanEval coding accuracy to 42.1%.

11255192.8K

#60

Original post

Tim G. J. Rudner@timrudner#1491inTech

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM · Jun 9, 2026 · 2.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS274REPLIES1

Tim G. J. Rudner@timrudner

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1d27400

BOOKMARKS1LIKES1

Tim G. J. Rudner@timrudner

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

💻Code: http://github.com/jacopo-minniti/relay 📄Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/

Tim G. J. Rudner@timrudner

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

1d4111

RETWEETS2

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1d2.4K2818

Tim G. J. Rudner@timrudner

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

Tim G. J. Rudner@timrudner

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

1d7901

Tim G. J. Rudner@timrudner

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

Tim G. J. Rudner@timrudner

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

1d14700

Tim G. J. Rudner@timrudner

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

Tim G. J. Rudner@timrudner

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

1d7900

Tim G. J. Rudner@timrudner

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

Tim G. J. Rudner@timrudner

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

1d7600

Tim G. J. Rudner@timrudner

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

Tim G. J. Rudner@timrudner

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

1d6300

Dhruvesh Patel@_dhruveshp

Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 😉) need to think ahead! Checkout our preprint👇

Tim G. J. Rudner@timrudner

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1d6010

Tim G. J. Rudner@timrudner

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

Tim G. J. Rudner@timrudner

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

1d4800

Tim G. J. Rudner@timrudner

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

Tim G. J. Rudner@timrudner

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

1d4700

Tim G. J. Rudner@timrudner

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

Tim G. J. Rudner@timrudner

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

1d4400

Tim G. J. Rudner@timrudner

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

📄Paper: https://arxiv.org/abs/2605.22967 💻Code: http://github.com/jacopo-minniti/relay 📢Blog: https://iesl.cs.umass.edu/diffusion/blog/2026/relay/

Tim G. J. Rudner@timrudner

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

1d7210