Gavin Leech, UK-based AI researcher and co-founder of the consultancy Arb, notes vision models are roughly 1000 times smaller than text models owing to language's data compression through compositional semantics and abstractions
AI Judge changed title after evaluation, original title: "AI researchers discuss efficiency advantages of language models over vision models, noting the latter are roughly 1000 times smaller thanks to language's high-density compression via compositional semantics"
Replies linked the gap to potential image-based chain-of-thought reasoning.
Users are excited about language models' superior data compression over vision models because text conveys far deeper meaning relative to its size.
No Digg Deeper questions have been answered for this story yet.
Most Activity

@gleech Can you expand on "language's god-tier data compression"? The amount of the human cortex devoted to vision is much larger than language and declarative memory. Suggests LLMs might not be in their final form.

@TobyLightheart my understanding is that
1) functional specialisation is true but overstated. e.g. the cerebellum isn't "motor area", it lights up for everything 2) a lot of the visual demand is due to the demand for real-time vision processing. Relax that and you wouldn't need 30% of cortex

@ArmanMaesumi some humans do video generation, and their training process is also brutal
https://youtu.be/5osZk9Mw94w?si=lbyezBofYVU3IJM0

@gleech For image/video training is far more expensive per parameter because of activation size. Same goes for the size of each piece of data, especially video which requires decode

@TobyLightheart kōan: what determines the resolution we call "real-time"?

@gleech A picture is worth a thousand words.

@gleech Note: those image models dont understand, and cannot replicate, any intelligible text.

@ArmanMaesumi Yeah brutal memory cost, and generation is a different beast than the classification stuff I was thinking of.
On flops, I think it's vision classification << LLM generation << video diffusion

@TobyLightheart @gleech AI vision models use a small number of parameters and repeat those parameters over every pixel. Human vision also has similar repetitions, but we count every neuron separately. If we scored vision models like the visual cortex, they would be vastly larger.

@sichuan_mala there are 22B ViTs but they're not dominant or strictly necessary in the way you'd expect

@gleech Is that a fundamental property of vision models vs text models? Or just a result of which ones have been indexed on?

@SoCratesNOCAP @gleech A thousand words of plain text is 8kb. Most pictures are over a 100kb.

@gleech Why would vision models be smaller than language models if language were more compressible?

@gleech @aliceisplaying that said, textual expressive capacity is so much deeper than vision.

@SoCratesNOCAP @gleech IIUC it's more correctly to say one picture *costs* a thousand words while being worth just 1/1000th words. Here's a 4kb picture as proof.

@Substr8Monopoly @gleech It would be actually inverse of that - language is *already* very compressed representation, so it is much harder to compress any further. Typical image can be described by a few sentences on all major features while taking orders of magnitude more space even when compressed.

@gleech language is wild like that. text carries so much meaning compared to its size.

@DoozerDiffuser @gleech Bro u are so far behind the news cycle

@saintMarxPlace could easily be a function of differing amounts of inputs into each subfield, yes, but I don't think it is

@sichuan_mala HunyuanVideo is 13B + maybe another 10B for the audio. Sora probably a bit more.
One wrinkle is that the generators need language understanding, so there will be a real chonky LLM involved somewhere