1d ago

New Routing Method Speeds Diffusion Transformer Training 8.75x

0
Original post

Image diffusion Transformers train poorly because their layers pass information in a fixed, outdated way. Now they can train much faster by changing how layers share information. With this paper, the same image quality arrived with 8.75x fewer training iterations. The surprise is not that Diffusion Transformers had an inefficiency, but where it was hiding. Researchers have spent years refining attention, conditioning, tokenization, objectives, and autoencoders, while leaving the residual stream mostly untouched because it looked like plumbing rather than intelligence. In a standard residual stack, every layer keeps adding its output to the running stream, which sounds harmless until the stream’s magnitude swells, gradients fade backward, and neighboring blocks begin saying nearly the same thing. That is bad for any Transformer, but it is especially awkward for diffusion, because denoising is not one fixed task repeated at every step. The authors found 3 signs that this old setup hurts the model: signals get too large going forward, learning signals fade going backward, and nearby blocks often produce almost the same features. Their fix is Diffusion-Adaptive Routing, a replacement that lets each layer choose which earlier layer outputs to use, and the choice changes with the denoising timestep. The big deal is that the paper does not add a new image dataset, loss, tokenizer, or attention trick, but instead questions the old residual connection that most models kept copying from language Transformers. ---- Link – arxiv. org/abs/2605.20708 Title: "Rethinking Cross-Layer Information Routing in Diffusion Transformers"

4:03 AM · May 28, 2026 View on X
Reposted by