/Tech1d ago

Tim G. J. Rudner, University of Toronto assistant professor, releases RELAY to help Masked Diffusion Models propagate continuous context

The method raised HumanEval coding accuracy to 42.1%.

11255192.8K
Original post
Tim G. J. Rudner@timrudner#1491inTech

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

10:57 AM · Jun 9, 2026 · 2.4K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS274REPLIES1

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1dViews 274Likes 0Bookmarks 0
BOOKMARKS1LIKES1

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

💻Code: http://github.com/jacopo-minniti/relay 📄Blog: https://www.iesl.cs.umass.edu/diffusion/blog/2026/relay/

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

1dViews 41Likes 1Bookmarks 1
RETWEETS2

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1dViews 2.4KLikes 28Bookmarks 18

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

1dViews 79Likes 0Bookmarks 1

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

The problem: Masked Diffusion Models (MDMs) generate by unmasking tokens step by step, but discard all internal computation between steps.

The only information carried forward is which tokens are still masked---but no representational information is preserved.

1dViews 147Likes 0Bookmarks 0

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

1dViews 79Likes 0Bookmarks 0

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

Our idea: Pass learned continuous representations across denoising steps.

We call them Learned Relay Representations---a trainable representations passed forward between denoising steps like a baton in a relay race.

1dViews 76Likes 0Bookmarks 0

We validate our method ("Relay") on Sudoku-Extreme and find that it hits the best accuracy-per-NFE on the Pareto frontier, beating standard masked diffusion, rollouts, and a no-BPTT relay.

How do we enable models to learn to plan ahead?

We use truncated backpropagation through time (BPTT), where gradients flow across steps, so a decision at step t gets credit for what happens at step t+k.

1dViews 63Likes 0Bookmarks 0
Dhruvesh Patel@_dhruveshp

Do you find autoregressive language models like @AnthropicAI's @claudeai Opus too slow? Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models 😉) need to think ahead! Checkout our preprint👇

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

1dViews 60Likes 1Bookmarks 0

Bonus: BPTT through two denoising steps barely affects peak memory: 20.1 GiB (Relay) vs. 21.2 GiB (vanilla SFT).

Fast-dLLM v2's SFT forward already runs at 2x batch, so Relay's two forwards roughly equals one, and the LM-head backward sets the peak in both.

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

1dViews 48Likes 0Bookmarks 0

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

Why does BPTT help? At the same threshold, Relay commits more cells per step while keeping the board legal: 74.8% fully-legal boards vs 70.7% without BPTT.

We reach the same level of accuracy but in fewer steps (using more aggressive unmasking).

1dViews 47Likes 0Bookmarks 0

On HumanEval, Relay beats vanilla SFT accuracy (42.1% vs. 38.4%) while using 32% fewer forward passes (88.3 vs. 130.7 NFE). RELAY is more accurate AND faster.

Does it scale? We adapt Fast-dLLM v2 (1.5B), a SoTA MDM with KV caching + block-parallel decoding, into a Relay model. Importantly, it's a drop-in on top of existing MDMs.

1dViews 44Likes 0Bookmarks 0

This is joint work with Ben Rozonoyer, Jacopo Minniti, @_dhruveshp, @neilbband, @bose_joey, and @andrewmccallum.

📄Paper: https://arxiv.org/abs/2605.22967 💻Code: http://github.com/jacopo-minniti/relay 📢Blog: https://iesl.cs.umass.edu/diffusion/blog/2026/relay/

Relay is architecture-agnostic, leaves inference decoding unchanged, and composes with block diffusion + KV caching. It's a simple drop-in for existing MDMs!

1dViews 72Likes 1Bookmarks 0