🔥 New paper: BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
Are uniform-state diffusion models (USDMs) always stronger than masked (MDMs) ones? Recent work suggests so. However, a few questions remain open 🤔
w/ @caglarml
(1/11)
🔥 New paper: BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
Are uniform-state diffusion models (USDMs) always stronger than masked (MDMs) ones? Recent work suggests so. However, a few questions remain open 🤔
w/ @caglarml
(1/11)

@caglarml 🔁 For example, unlike MDMs, USDMs can self-correct earlier mistakes. However, prior work has not compared USDMs against MDMs paired with strong remasking correctors, which let masked models revise mistakes too 🤔
(2/11)

@caglarml @mariannearr --- ❓ Q1: do USDMs still beat MDMs when generating block-by-block?
Under ancestral sampling, yes. Uniform wins, with the biggest gap at low NFE.
(6/11)

@caglarml @mariannearr 📈 Training over a mixture of block sizes
On OpenWebText, with uniform mixture weights, BlockGen reaches 17.5 PPL (masked), slightly above AR (16.7) and well below the best fixed-block model (21.6). Even with only 5% of steps in AR mode, it still reaches 19.1.
(4/11)

@caglarml @mariannearr 🔀 Using a strong predictor-corrector (ARPC) changes the picture a bit
Low NFE: USDMs are better. High NFE: MDMs are slightly better.
(7/11)

@caglarml @mariannearr 🔧 The mixture also enables ARPC: AR-Informed Predictor-Corrector sampling
The same model runs an AR pass to score each token, then re-generates the unlikely ones. Unlike speculative decoding, where the acceptance rate sets the NFE, ARPC lets you decide the NFE budget.
(5/11)

@caglarml Block Diffusion (@mariannearr) generates tokens block-by-block, left to right, which enables KV caching. We extend block diffusion by training over a mixture of block sizes. This improves likelihood and enables hybrid samplers ☯️
(3/11)

@caglarml @mariannearr 🧗 Beating AR models is hard.
ARPC helps, and training for just 5% of the steps in AR mode is sufficient. But greedy AR stays the strongest model on GSM8k.
(8/11)
🔥 New paper: BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
Are uniform-state diffusion models (USDMs) always stronger than masked (MDMs) ones? Recent work suggests so. However, a few questions remain open 🤔
w/ @caglarml
(1/11)