/AI8h ago

New Paper BlockGen Challenges Dominance of Uniform-State Diffusion Models

33812181.5K

#816

Original post

Caglar Gulcehre#816

Justin Deschenaux@jdeschena

🔥 New paper: BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

Are uniform-state diffusion models (USDMs) always stronger than masked (MDMs) ones? Recent work suggests so. However, a few questions remain open 🤔

w/ @caglarml

(1/11)

10:07 AM · Jun 6, 2026 · 1.5K Views

/AI8h ago

New Paper BlockGen Challenges Dominance of Uniform-State Diffusion Models

33812181.5K

#816

Original post

Caglar Gulcehre#816

Justin Deschenaux@jdeschena

🔥 New paper: BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

Are uniform-state diffusion models (USDMs) always stronger than masked (MDMs) ones? Recent work suggests so. However, a few questions remain open 🤔

w/ @caglarml

(1/11)

10:07 AM · Jun 6, 2026 · 1.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

Justin Deschenaux@jdeschena

@caglarml 🔁 For example, unlike MDMs, USDMs can self-correct earlier mistakes. However, prior work has not compared USDMs against MDMs paired with strong remasking correctors, which let masked models revise mistakes too 🤔

(2/11)

8h242

BOOKMARKS1

Justin Deschenaux@jdeschena

@caglarml @mariannearr --- ❓ Q1: do USDMs still beat MDMs when generating block-by-block?

Under ancestral sampling, yes. Uniform wins, with the biggest gap at low NFE.

(6/11)

8h411

LIKES2

Justin Deschenaux@jdeschena

@caglarml @mariannearr 📈 Training over a mixture of block sizes

On OpenWebText, with uniform mixture weights, BlockGen reaches 17.5 PPL (masked), slightly above AR (16.7) and well below the best fixed-block model (21.6). Even with only 5% of steps in AR mode, it still reaches 19.1.

(4/11)

8h172

REPLIES1

Justin Deschenaux@jdeschena

@caglarml @mariannearr 🔀 Using a strong predictor-corrector (ARPC) changes the picture a bit

Low NFE: USDMs are better. High NFE: MDMs are slightly better.

(7/11)

8h2

Justin Deschenaux@jdeschena

@caglarml @mariannearr 🔧 The mixture also enables ARPC: AR-Informed Predictor-Corrector sampling

The same model runs an AR pass to score each token, then re-generates the unlikely ones. Unlike speculative decoding, where the acceptance rate sets the NFE, ARPC lets you decide the NFE budget.

(5/11)

8h611

Justin Deschenaux@jdeschena

@caglarml Block Diffusion (@mariannearr) generates tokens block-by-block, left to right, which enables KV caching. We extend block diffusion by training over a mixture of block sizes. This improves likelihood and enables hybrid samplers ☯️

(3/11)

8h222

Justin Deschenaux@jdeschena

@caglarml @mariannearr 🧗 Beating AR models is hard.

ARPC helps, and training for just 5% of the steps in AR mode is sufficient. But greedy AR stays the strongest model on GSM8k.

(8/11)