Latent-space models are a cage we’ve boxed ourselves into. The reason for using them in the first place was always efficiency, but we lost the plot and forgot that the speed costs us in terms of progress. It’s time to move on to pixel-space models for the next state of the art.
Peyman Milanfar argues that latent-space models restrict generative AI progress, urging a shift to pixel-space architectures
Luca Ambrogioni questioned if pixel-space scales to high-resolution video
Positive users back the call to shift AI models from latent-space to pixel-space because latent space feels overly restrictive, while negative users defend latent space for robotics applications or dismiss the idea as nonsensical.
No Digg Deeper questions have been answered for this story yet.
Most Activity

@docmilanfar Yeah, the box is too tight

@docmilanfar Here's a template for when the plot gets lost on what to plot next:

@docmilanfar It would be interesting to see how pixel-space model handle global climate data with resolution over 10k.

@docmilanfar I partially disagree: 1. There is no natural, universal pixel-space distance. And probably there never will be. 2. The problem isn't the latent space itself; it's the failure to consider the predictable and unpredictable parts of images (data). The problem lies with the decoders.

@docmilanfar Why are pixel space models preferable?

@docmilanfar Frozen VAEs may disappear. Compressed representations won’t. Some lower-dimensional latent will probably remain at least until the next major compute/modeling breakthrough

@docmilanfar instead of abandoning latent space, should we focus on better, dynamically scaling latent topologies? I agree in relation to VAEs

@docmilanfar Even for high-res videos?

@docmilanfar Atoms space is pretty cool

@docmilanfar Meanwhile robotics and world model applications are suffering from pixel space issues, latent space is the way to go

@docmilanfar Totally agree!

@docmilanfar That would be the opposite call from the one Lecun is making, right? Or did I miss something?

@docmilanfar Am I missing something? Pixel-space seems like a sparse representation of the world just like a lot of latent spaces

@docmilanfar What? This doesn't make any sense.

@DadMakingGames @docmilanfar it's not the topology that matters, it's the geometry you

@docmilanfar Lol

@subirv @docmilanfar they are not. latent could have many forms, perhaps OP is talking about a very specific one that they claim is hard to train/scale.

@docmilanfar In my opinion, decoders should decode using generative priors. We should revisit VAEs, but with the tools achieved in the field of conditional generation.

@liuzhisong_cv @docmilanfar Maybe we don't need to train the whole domain but the "local" physics and deploy multiple agents for solving the total domain. At the end the physics are equivariant in space and time.

@docmilanfar Agree on the stagnation. Latent-space traded interpretability for efficiency. Pixel-space could prioritize feature predictability over realism, but compute remains the bottleneck.