13h ago

Leonardo.AI co-founder Ethan Smith warns that 10x training speedup claims in recent diffusion papers may rely on flawed metrics

Prior studies confirm similar speedup overestimations in Transformer pre-training

8474244.5K

——0——

Original post

Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and legitamately has sizable wins, as fairly low hanging fruit, that aren’t as available to AR models

12:29 AM · May 25, 2026

#713Pasquale Minervini@PMINERVINI

@torchcompiled hi, we had a kind of bitter lesson when trying to look into "accellerating pre-training" literature: https://arxiv.org/abs/2307.06440

Ethan@torchcompiled

Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models

7:32 AM · May 25, 2026 · 3.5K Views

12:53 PM · May 25, 2026 · 179 Views

POST

#1884Ethan@TORCHCOMPILED

Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models

7:32 AM · May 25, 2026 · 3.5K Views

#1884Ethan@TORCHCOMPILED

This is not saying that diffusion will surpass AR in acceleration necessarily, but more that LLM improvements have much smaller deltas in gain than diffusion papers, and this might be reflective that the current state of diffusion is clunky

Ethan@torchcompiled

Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models

7:32 AM · May 25, 2026 · 3.5K Views

7:33 AM · May 25, 2026 · 759 Views