The latent-vs-pixel debate misses the point.
GPT Image 2 shows what users notice: pixel-level fidelity. Latent models show what scales: compact semantic structure.
We connect them by replacing VAE/RAE decoders with a Pixel Diffusion Decoder.
Code and Model available: https://research.nvidia.com/labs/sil/projects/pid/
🧵(1/N)










