Diffusion Papers Claim 10x Training Speedups With Orthogonal Techniques
Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models
This is not saying that diffusion will surpass AR in acceleration necessarily, but more that LLM improvements have much smaller deltas in gain than diffusion papers, and this might be reflective that the current state of diffusion is clunky
Numerous diffusion papers I’m seeing are citing accelerating training on the order of 10x or so (if not more). Not to mention many of these are orthogonal directions like compression, additional losses/supervision like REPA, token dropping like TREAD, and many more. I’m kinda tempted to say the metric by which acceleration is measured might be off? On the other hand I feel like diffusion/flow could be a more complex fish to fry and might be running in a quite suboptimal way, considering how many design choices there are both in representation space, architecture, parameterization of the diffusion itself. Then, it may legitamately have sizable wins, as fairly low hanging fruit, that aren’t as available to AR models