@torchcompiled hi, we had a kind of bitter lesson when trying to look into "accellerating pre-training" literature: https://arxiv.org/abs/2307.06440
Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models


