Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
Midjourney founder David Holz argues diffusion models will dominate because scaling FLOPS is easier than scaling memory bandwidth
AI Judge changed title after evaluation, original title: "Midjourney founder David Holz argues cheap FLOPS favor diffusion models, while Emad Mostaque proposes a hybrid training-to-inference workflow"
Critics argue scaling FLOPS-per-watt efficiency remains the main bottleneck
Positive users back all-in diffusion models as the logical path forward with cheaper FLOPS and hardware trends, while negative users reject the idea as unhinged and respond with insults or accusations of withheld research.
No Digg Deeper questions have been answered for this story yet.
Most Activity
I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
Diffusion: “Better than diffusion autoregressive is.”
Autoregressive: “Autoregressive is better than diffusion.”
RL: I like my variances low for assigning credit.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
In a multimodal context, even the discrete/continuous divide is a distraction.
The real challenge is bridging the semantic gap between inherently high-level language tokens, and the very low-level representations we tend to use for perceptual signals.
(I couldn't resist😆)
I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.
Explanation:
Diffusion is loosely a probability product joke, combined with credit assignment across many denoising steps and alignment, rl requires relying on even more complex variance reduction methods.
Very impressed by people who got this.
@DavidSHolz That’s exactly the bet we’re making at @_inception_ai
We’re already matching speed-optimized models from frontier labs on quality, while being faster and more cost efficient. That gap will only widen as we continue to scale.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
Very impressed by people who got this.
Diffusion: “Better than diffusion autoregressive is.”
Autoregressive: “Autoregressive is better than diffusion.”
RL: I like my variances low for assigning credit.
@DavidSHolz No - scaling pJ/flop is fundamentally quite hard, while scaling memory bandwidth is easy by making some packaging and capacity tradeoffs.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
Train with autoregression & convert weights to diffusion for inference.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????
Many are saying this.
Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

@DavidSHolz Memory bandwidth is cheap here
@itsclivetime I thought a lot of what the industry has been doing is just making the die bigger and riding new process node improvements and adding more low bit precision capacity? sounds like flops scaling? is it really about pj/flop or just total flops per rack?
@DavidSHolz No - scaling pJ/flop is fundamentally quite hard, while scaling memory bandwidth is easy by making some packaging and capacity tradeoffs.

@_arohan_ more info here than most papers published on the arxiv
@DavidSHolz process node improvements are roughly zero, and low bit precision matmuls already saturate power consumption so there's not that much room to grow there
total flops per rack means bigger faster models, but does not mean cheaper flops (density is more expensive all else equal)
@itsclivetime I thought a lot of what the industry has been doing is just making the die bigger and riding new process node improvements and adding more low bit precision capacity? sounds like flops scaling? is it really about pj/flop or just total flops per rack?
@DavidSHolz you can definitely make a chip that's a constant factor cheaper if you remove the HBM (see Nvidia CPX) but that does limit your workload options, and does not continue scaling after memory cost reaches zero
@itsclivetime to be fair, if b200 had the same memory as a a100 then certainly we would be getting better flops per dollar? so it sounds like flops are getting cheaper but memory is soaking up all the slack?
@baaadas this is a fun take!
I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.
@DavidSHolz flat since ~A100!
@DavidSHolz yup, flops/$ and flops/W are roughly flat excluding precision reduction
hard to go much further below fp4!
@DavidSHolz yup, flops/$ and flops/W are roughly flat excluding precision reduction
hard to go much further below fp4!
@itsclivetime I don't feel like I'm getting cheaper flops from anyone, every 16 months my servers get twice as fast and cost twice as much? I guess there's the bargain bin gpus but are those really driving the industry?
@itsclivetime I don't feel like I'm getting cheaper flops from anyone, every 16 months my servers get twice as fast and cost twice as much? I guess there's the bargain bin gpus but are those really driving the industry?
@DavidSHolz process node improvements are roughly zero, and low bit precision matmuls already saturate power consumption so there's not that much room to grow there
total flops per rack means bigger faster models, but does not mean cheaper flops (density is more expensive all else equal)

@_arohan_ People who got this impressed i am with
@itsclivetime @grok can you estimate the cost scaling curves for flops versus memory versus memory bandwidth for GPU like systems (wonder what this will do)
@DavidSHolz you can definitely make a chip that's a constant factor cheaper if you remove the HBM (see Nvidia CPX) but that does limit your workload options, and does not continue scaling after memory cost reaches zero