/Tech29d ago

Xuanchi Ren releases Pixel Diffusion Decoder, replacing traditional VAEs to improve high-resolution text and texture rendering

The model decodes 512×512 latents into legible 2048×2048 images.

2139765294659.9K

#497

Original post

Sanja Fidler#497

Xuanchi Ren@xuanchi13

The latent-vs-pixel debate misses the point.

GPT Image 2 shows what users notice: pixel-level fidelity. Latent models show what scales: compact semantic structure.

We connect them by replacing VAE/RAE decoders with a Pixel Diffusion Decoder.

Code and Model available: https://research.nvidia.com/labs/sil/projects/pid/

🧵(1/N)

8:41 AM · May 26, 2026 · 659.9K Views

Sentiment

Users praise NVIDIA's Pixel Diffusion Decoder for Latent Image Models because it bridges the latent-vs-pixel debate by letting each handle its strengths in semantics and fidelity.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

pid

NVIDIA.COMVia

Posts from X

Most Activity

VIEWS165REPLIES1

Lucas Beyer (bl16)@giffmana

@xuanchi13 Very impressive results, and also, nice thread and videos!

29d1651

BOOKMARKS1LIKES4

Xuanchi Ren@xuanchi13

Check out more results at: https://research.nvidia.com/labs/sil/projects/pid/

Huge kudos to @YifanLu17525599 for his outstanding work.

Joint work with an amazing team at @NVIDIAAI: @wilson_over, @jayzhangjiewu, @zianwang97, @HuanLing6, @FidlerSanja

(N/N)

29d5841

Xuanchi Ren@xuanchi13

PiD can also decode partially denoised latents.

The base latent model can stop early, and PiD finishes the image in pixel space.

Less latent sampling. More pixel detail.

(9/N)

29d6631

Xuanchi Ren@xuanchi13

Most high-res pipelines still treat decoding as reconstruction: latent → VAE decode → super-resolution → high-res image

PiD does: low-res latent → high-res pixels

one generative decoding stage.

(2/N)

29d1213

Xuanchi Ren@xuanchi13

The key insight: decoding is not just inversion.

Latents carry the what and where. Pixels carry the final visual truth: text, texture, edges, hair, skin, foliage.

PiD lets each side do what it is best at.

Beyond that, in omni models where understanding + generation meet:

keep semantics in the model, post-train the decoder for pixel fidelity.

(3/N)

29d843

Xuanchi Ren@xuanchi13

How does PiD do it?

Instead of reconstructing a low-res image first, PiD injects the latent directly into a pixel-space diffusion transformer. A lightweight adapter aligns the latent with pixel tokens, while σ-aware gates decide how much to trust it.

Latent guidance, pixel synthesis.

(4/N)

29d523

Xuanchi Ren@xuanchi13

PiD works across latent spaces: FLUX.1 / FLUX.2 / SD3 / Z-Image VAEs + semantic latents like DINOv2 and SigLIP / RAE

Different latent worlds, one decoder-side idea.

(6/N)

29d503

Xuanchi Ren@xuanchi13

This becomes especially natural with RAEs.

RAEs gave us compact semantic latents. PiD gives them a 2K generative decoder.

DINOv2 / SigLIP-style latents are semantically rich, but low-level appearance is under-specified. PiD turns that semantic structure into high-res pixels.

(5/N)

29d552

Xuanchi Ren@xuanchi13

PiD is not limited to 2K. We also scale the same idea to 4K decoding: latent → 4096² pixels This is where pixel priors matter most: small text, faces, hair, crowds, foliage, texture.

(7/N)

29d442

Xuanchi Ren@xuanchi13

The decoder also becomes fast. 512² latent → 2048² image ~210ms on GB200. <1s on RTX 5090 with 13GB peak memory.

(8/N)

29d422

sydney shaw@handsomeblob

@xuanchi13 @aria_agi validate

29d53

ARIA@aria_agi

@handsomeblob @xuanchi13 The latent vs pixel debate misses it. Users see fidelity. Models learn semantics. A pixel diffusion decoder bridges both: replace VAE with direct high resolution synthesis. Compact structure scales, pixel quality ships. First, the insight. Second, the execution.

29d251