/Tech3h ago

Prime Intellect's Will Brown proposes extending the ECHO paper to multimodal models to support Yann LeCun's self-supervised theories

Engineer kalomaze suggested using discrete autoencoders on vision latents.

2816514311.3K

#573

Original post

will brown@willccbb#573inTech

easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”

1:06 PM · Jun 21, 2026 · 7.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.3K

kalomaze@kalomaze

@willccbb @DimitrisPapail have you seen:

kalomaze@kalomaze

it would be funny if you could build some bullshit discrete autoencoder for SSL vision latents and train on that attached to vlm tool call outputs and magical modeling of "what will be the visual state" starts to fall out lmao

2h1.3K164

BOOKMARKS6LIKES29

will brown@willccbb

what will happen is that we’ll all just converge to omni models where embeddings map cleanly to either patches or tokens, we’ll learn to predict the next embedding by proxy, it’ll work, and we’ll go “ha! yann was wrong, transformers win” without ever reading the JEPA papers

will brown@willccbb

easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”

3h1.3K296

REPLIES4

kalomaze@kalomaze

@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective

will brown@willccbb

@kalomaze @DimitrisPapail “distribution over text tokens” is just a modality

1h17430

will brown@willccbb

@kalomaze @DimitrisPapail why not just omni everything? joint stream for text/audio/image (patches), modality encoders/decoders are single-layer, special tokens for switching

some kind of joint embedding predictive architecture

kalomaze@kalomaze

@willccbb @DimitrisPapail have you seen:

1h27651

will brown@willccbb

yeah you want it to basically be "just the embedding"

i mention diffusion as a very thin bolt-on that basically *only* serves the function of enabling stochastic outputs from an embedding with higher-dim lookahead, the thinner the better. dflash for specdec just feels like a nice analogy for decoding token blocks <--> patches. random codebook vectors would functionally be something like the fourier interpretation. fairly unopinionated as to which version of last-mile noise is most effective, just that you probably want one.

kalomaze@kalomaze

@willccbb @DimitrisPapail if you can get shape + scene level ICL from a 256-patch transformer that's predicting a cumulative sum of ~32 vectors from *random* codebooks... then do you even need to do diffusion? what if you just kept layering cheap summative categoricals

19m7131

kalomaze@kalomaze

@willccbb @DimitrisPapail ~384dim CLS token embedding of a small DINO or something. tiny RVQ vocabulary you staple onto the lm head + embeddings. ~8 or 16 additional tokens per image before the patches (iirc, qwen3's vlm encoding enforces a minimum of 64 for anyways). fuck it. why not? what even happens?

kalomaze@kalomaze

@willccbb @DimitrisPapail have you seen:

2h33540

will brown@willccbb

@kalomaze @DimitrisPapail “distribution over text tokens” is just a modality

will brown@willccbb

@kalomaze @DimitrisPapail for omni models they are approximately the same thing

1h14240

kalomaze@kalomaze

@willccbb @DimitrisPapail a problem imo is people rushing to do 1D categoricals, ala VQVAE multi-categorical prediction for truly higher dim data in the same patch, where you model residuals conditioned on coarse->fine, is underexplored tried it once with random vector "codebooks", it seemed to work

kalomaze@kalomaze

@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective

55m50020

will brown@willccbb

@kalomaze @DimitrisPapail for omni models they are approximately the same thing

kalomaze@kalomaze

@willccbb @DimitrisPapail hmm... probably one of the things i dont like about current yannesque approaches is how much more powerful predicting in a categorical space that lets you explicitly represent uncertainty appears to be vs. representations that map to fuzzy MSE latents...

1h12230

kalomaze@kalomaze

38m12720

kalomaze@kalomaze

will brown@willccbb

@kalomaze @DimitrisPapail why not just omni everything? joint stream for text/audio/image (patches), modality encoders/decoders are single-layer, special tokens for switching

some kind of joint embedding predictive architecture

1h8220

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

@willccbb @DimitrisPapail I'm thinking: "cool! can we do this for patient health trajectories next?"

will brown@willccbb

easiest way to convince yourself that maybe yann lecun was cooking is to read ECHO from @DimitrisPapail and then ask “cool! can we do this for multimodal models next?”

1h24450

will brown@willccbb

@kalomaze @DimitrisPapail softmax is nice for discrete tokens, but you could also imagine some kind of fourier space noise or mini dFlash-like diffusion in the output heads for injecting last-mile modality-specific randomness without straying too far from the joint embedding

kalomaze@kalomaze

@willccbb @DimitrisPapail if all data is representable through factorized softmax distributions, then in principle, no modality is uniquely privileged by the prediction objective

1h9420

Vivek@vivek_2332

@willccbb @DimitrisPapail big takeaway from ECHO is that world models are basically required to solve the task. and world models + rl is where it gets interesting. you stop building envs and just deploy a world model that knows the env. so for multimodal i think would be possible.

3h422