easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”
Prime Intellect's Will Brown proposes extending the ECHO paper to multimodal models to support Yann LeCun's self-supervised theories
Engineer kalomaze suggested using discrete autoencoders on vision latents.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@willccbb @DimitrisPapail have you seen:
it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao
what will happen is that we’ll all just converge to omni models where embeddings map cleanly to either patches or tokens, we’ll learn to predict the next embedding by proxy, it’ll work, and we’ll go “ha! yann was wrong, transformers win” without ever reading the JEPA papers
easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”
@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective
@kalomaze @DimitrisPapail “distribution over text tokens” is just a modality
@kalomaze @DimitrisPapail why not just omni everything? joint stream for text/audio/image (patches), modality encoders/decoders are single-layer, special tokens for switching
some kind of joint embedding predictive architecture
@willccbb @DimitrisPapail have you seen:
yeah you want it to basically be "just the embedding"
i mention diffusion as a very thin bolt-on that basically *only* serves the function of enabling stochastic outputs from an embedding with higher-dim lookahead, the thinner the better. dflash for specdec just feels like a nice analogy for decoding token blocks <--> patches. random codebook vectors would functionally be something like the fourier interpretation. fairly unopinionated as to which version of last-mile noise is most effective, just that you probably want one.
@willccbb @DimitrisPapail if you can get shape + scene level ICL from a 256-patch transformer that's predicting a cumulative sum of ~32 vectors from *random* codebooks... then do you even need to do diffusion? what if you just kept layering cheap summative categoricals
@willccbb @DimitrisPapail ~384dim CLS token embedding of a small DINO or something. tiny RVQ vocabulary you staple onto the lm head + embeddings. ~8 or 16 additional tokens per image before the patches (iirc, qwen3's vlm encoding enforces a minimum of 64 for anyways). fuck it. why not? what even happens?
@willccbb @DimitrisPapail have you seen:
@kalomaze @DimitrisPapail “distribution over text tokens” is just a modality
@kalomaze @DimitrisPapail for omni models they are approximately the same thing
@willccbb @DimitrisPapail a problem imo is people rushing to do 1D categoricals, ala VQVAE multi-categorical prediction for truly higher dim data in the same patch, where you model residuals conditioned on coarse->fine, is underexplored tried it once with random vector "codebooks", it seemed to work
@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective
@kalomaze @DimitrisPapail for omni models they are approximately the same thing
@willccbb @DimitrisPapail hmm... probably one of the things i dont like about current yannesque approaches is how much more powerful predicting in a categorical space that lets you explicitly represent uncertainty appears to be vs. representations that map to fuzzy MSE latents...
@willccbb @DimitrisPapail if you can get shape + scene level ICL from a 256-patch transformer that's predicting a cumulative sum of ~32 vectors from *random* codebooks... then do you even need to do diffusion? what if you just kept layering cheap summative categoricals
@willccbb @DimitrisPapail a problem imo is people rushing to do 1D categoricals, ala VQVAE multi-categorical prediction for truly higher dim data in the same patch, where you model residuals conditioned on coarse->fine, is underexplored tried it once with random vector "codebooks", it seemed to work
@willccbb @DimitrisPapail hmm... probably one of the things i dont like about current yannesque approaches is how much more powerful predicting in a categorical space that lets you explicitly represent uncertainty appears to be vs. representations that map to fuzzy MSE latents...
@kalomaze @DimitrisPapail why not just omni everything? joint stream for text/audio/image (patches), modality encoders/decoders are single-layer, special tokens for switching
some kind of joint embedding predictive architecture
@willccbb @DimitrisPapail I'm thinking: "cool! can we do this for patient health trajectories next?"
easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”
@kalomaze @DimitrisPapail softmax is nice for discrete tokens, but you could also imagine some kind of fourier space noise or mini dFlash-like diffusion in the output heads for injecting last-mile modality-specific randomness without straying too far from the joint embedding
@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective

@willccbb @DimitrisPapail big takeaway from ECHO is that world models are basically required to solve the task. and world models + rl is where it gets interesting. you stop building envs and just deploy a world model that knows the env. so for multimodal i think would be possible.