/Tech2h ago

Jiatao Gu and Sander Dieleman argue generative AI is shifting from latent diffusion back to raw pixel training

Faster hardware makes two-stage latent pipelines harder to justify

2801196

#91

Original post

Sander Dieleman@sedielem#91inTech

Agreed, I think we've reached a point where people are willing to give up the efficiency gains if latents to make their lives a lot simpler. The two-stage sequential dependency in training LDMs is painful, but we put up with it when the payoff is large enough.

As models get more sophisticated and hardware gets faster, more and more people will start to believe the payoff isn't worth the pain anymore. It seems like it's starting to happen here and there for image models, I think it will take quite a bit longer for video models though.

Jiatao Gu@thoma_gu

@LiangZheng_06 Latent is about efficiency but creates additional problems like information loss, decoder quality degradation. When efficiency becomes less problematic people feel this is like a bitter lesson and want to learn everything end2end.

4:42 PM · Jun 28, 2026 · 126 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS86LIKES2REPLIES1

Jiatao Gu@thoma_gu

Also another critical reason is people now find it possible to train PixelDiT much more efficient than before as well.

Pixel space is never a problem for Diffusion. However, early days... it requires UNet, Conv, Cascading, etc to make diffusion work well on raw pixels, which seems much more complicated than LDM.

But now many work including JiT shows we can do things much simpler. Then the benefits of modeling pixels seem to show again.

So I think the improvement of algorithm also makes things happen. But this is just for Diffusion, not for all model types..

Sander Dieleman@sedielem

2h8620