Midjourney founder David Holz argues diffusion models will dominate because scaling FLOPS is easier than scaling memory bandwidth

VIEWS48.3KBOOKMARKS103LIKES309RETWEETS13REPLIES14

I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

29d48.3K309103

rohan anil@_arohan_

Diffusion: “Better than diffusion autoregressive is.”

Autoregressive: “Autoregressive is better than diffusion.”

RL: I like my variances low for assigning credit.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

32d35.2K19974

Sander Dieleman@sedielem

In a multimodal context, even the discrete/continuous divide is a distraction.

The real challenge is bridging the semantic gap between inherently high-level language tokens, and the very low-level representations we tend to use for perceptual signals.

(I couldn't resist😆)

Jiaming Song@baaadas

I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.

28d14.4K14562

rohan anil@_arohan_

Explanation:

Diffusion is loosely a probability product joke, combined with credit assignment across many denoising steps and alignment, rl requires relying on even more complex variance reduction methods.

rohan anil@_arohan_

Very impressed by people who got this.

31d13K6933

Stefano Ermon@StefanoErmon

@DavidSHolz That’s exactly the bet we’re making at @_inception_ai

We’re already matching speed-optimized models from frontier labs on quality, while being faster and more cost efficient. That gap will only widen as we continue to scale.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

33d7.3K9926

rohan anil@_arohan_

Very impressed by people who got this.

rohan anil@_arohan_

Diffusion: “Better than diffusion autoregressive is.”

Autoregressive: “Autoregressive is better than diffusion.”

RL: I like my variances low for assigning credit.

32d26K6426

Clive Chan@itsclivetime

@DavidSHolz No - scaling pJ/flop is fundamentally quite hard, while scaling memory bandwidth is easy by making some packaging and capacity tradeoffs.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

33d5.6K5817

Emad@EMostaque

Train with autoregression & convert weights to diffusion for inference.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

33d6.9K3613

Beff (e/acc)@beffjezos

Many are saying this.

David@DavidSHolz

Most researchers agree that autoregression is best when memory bandwidth is cheap and diffusion is best when FLOPS are cheap. They also admit the future of compute is all FLOPS because memory scaling is hard and scaling FLOPS is easy. So why not go all in on diffusion????

33d8.1K648

Cerebras@cerebras

@DavidSHolz Memory bandwidth is cheap here

33d2.6K652

David@DavidSHolz

@itsclivetime I thought a lot of what the industry has been doing is just making the die bigger and riding new process node improvements and adding more low bit precision capacity? sounds like flops scaling? is it really about pj/flop or just total flops per rack?

Clive Chan@itsclivetime

@DavidSHolz No - scaling pJ/flop is fundamentally quite hard, while scaling memory bandwidth is easy by making some packaging and capacity tradeoffs.

33d4.2K275

Andrew Carr 🤸@andrew_n_carr

@_arohan_ more info here than most papers published on the arxiv

32d1.9K203

Clive Chan@itsclivetime

@DavidSHolz process node improvements are roughly zero, and low bit precision matmuls already saturate power consumption so there's not that much room to grow there

total flops per rack means bigger faster models, but does not mean cheaper flops (density is more expensive all else equal)

David@DavidSHolz

@itsclivetime I thought a lot of what the industry has been doing is just making the die bigger and riding new process node improvements and adding more low bit precision capacity? sounds like flops scaling? is it really about pj/flop or just total flops per rack?

33d979154

Clive Chan@itsclivetime

@DavidSHolz you can definitely make a chip that's a constant factor cheaper if you remove the HBM (see Nvidia CPX) but that does limit your workload options, and does not continue scaling after memory cost reaches zero

David@DavidSHolz

@itsclivetime to be fair, if b200 had the same memory as a a100 then certainly we would be getting better flops per dollar? so it sounds like flops are getting cheaper but memory is soaking up all the slack?

33d852132

David@DavidSHolz

@baaadas this is a fun take!

Jiaming Song@baaadas

I really think that autoregression and diffusion is a false dichotomy -- they can easily co-exist (e.g., diffusion forcing). The real one is between discrete and continuous tokens.

29d2.1K83

Clive Chan@itsclivetime

@DavidSHolz flat since ~A100!

Clive Chan@itsclivetime

@DavidSHolz yup, flops/$ and flops/W are roughly flat excluding precision reduction

hard to go much further below fp4!

33d764172

Clive Chan@itsclivetime

@DavidSHolz yup, flops/$ and flops/W are roughly flat excluding precision reduction

hard to go much further below fp4!

David@DavidSHolz

@itsclivetime I don't feel like I'm getting cheaper flops from anyone, every 16 months my servers get twice as fast and cost twice as much? I guess there's the bargain bin gpus but are those really driving the industry?

33d809171

David@DavidSHolz

@itsclivetime I don't feel like I'm getting cheaper flops from anyone, every 16 months my servers get twice as fast and cost twice as much? I guess there's the bargain bin gpus but are those really driving the industry?

Clive Chan@itsclivetime

@DavidSHolz process node improvements are roughly zero, and low bit precision matmuls already saturate power consumption so there's not that much room to grow there

total flops per rack means bigger faster models, but does not mean cheaper flops (density is more expensive all else equal)

33d875141

Core Automation@CoreAutoAI

@_arohan_ People who got this impressed i am with

32d2.2K102

David@DavidSHolz

@itsclivetime @grok can you estimate the cost scaling curves for flops versus memory versus memory bandwidth for GPU like systems (wonder what this will do)

Clive Chan@itsclivetime

@DavidSHolz you can definitely make a chip that's a constant factor cheaper if you remove the HBM (see Nvidia CPX) but that does limit your workload options, and does not continue scaling after memory cost reaches zero

33d46332