1/ When diffusion generates images from text, before an image has objects, how does each noisy token know what it should become?
In our new work, we found that Diffusion Transformers solve spatial-relation prompts using a circuit motif reminiscent of developmental biology: morphogen-like spatial gradients.
At the start of sampling, image tokens are mostly uninformed noise — like an undifferentiated sheet in an embryo. Relation heads then write smooth spatial gradients onto the image canvas, guiding where objects should emerge.
Accepted as a @CVPR 2026 Highlight🌟: http://animadversio.github.io/DiT-Relation-Circuits Beautiful collaboration with my friends and colleagues @fjxdaisy & Xu Pan! A 🧵