/Tech44d ago

Gavin Leech, UK-based AI researcher and co-founder of the consultancy Arb, notes vision models are roughly 1000 times smaller than text models owing to language's data compression through compositional semantics and abstractions

AI Judge changed title after evaluation, original title: "AI researchers discuss efficiency advantages of language models over vision models, noting the latter are roughly 1000 times smaller thanks to language's high-density compression via compositional semantics"

Replies linked the gap to potential image-based chain-of-thought reasoning.

312K48489193.7K

#1315

Original post

rohit@krishnanrohit#1315inTech

@1a3orn CoT in pictures but not words would be quite neat

1a3orn@1a3orn

this is a relevant consideration for projecting how Lindy intelligible CoT is likely to be

9:15 AM · May 17, 2026 · 257 Views

Sentiment

Users are excited about language models' superior data compression over vision models because text conveys far deeper meaning relative to its size.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4K

Toby Lightheart@TobyLightheart

@gleech Can you expand on "language's god-tier data compression"? The amount of the human cortex devoted to vision is much larger than language and declarative memory. Suggests LLMs might not be in their final form.

44d4K24

BOOKMARKS6LIKES59

gavin leech (Non-Reasoning)@gleech

@TobyLightheart my understanding is that

1) functional specialisation is true but overstated. e.g. the cerebellum isn't "motor area", it lights up for everything 2) a lot of the visual demand is due to the demand for real-time vision processing. Relax that and you wouldn't need 30% of cortex

43d3.8K596

RETWEETS1

gavin leech (Non-Reasoning)@gleech

@ArmanMaesumi some humans do video generation, and their training process is also brutal

https://youtu.be/5osZk9Mw94w?si=lbyezBofYVU3IJM0

43d7492

REPLIES2

Arman Maesumi@ArmanMaesumi

@gleech For image/video training is far more expensive per parameter because of activation size. Same goes for the size of each piece of data, especially video which requires decode

43d2.2K10

gavin leech (Non-Reasoning)@gleech

@TobyLightheart kōan: what determines the resolution we call "real-time"?

43d2K332

SoCrates (stay, safe)@SoCratesNOCAP

@gleech A picture is worth a thousand words.

43d2K112

🜛∞@DoozerDiffuser

@gleech Note: those image models dont understand, and cannot replicate, any intelligible text.

43d1.5K7

gavin leech (Non-Reasoning)@gleech

@ArmanMaesumi Yeah brutal memory cost, and generation is a different beast than the classification stuff I was thinking of.

On flops, I think it's vision classification << LLM generation << video diffusion

43d1.5K5

BorisTheBrave@boris_brave

@TobyLightheart @gleech AI vision models use a small number of parameters and repeat those parameters over every pixel. Human vision also has similar repetitions, but we count every neuron separately. If we scored vision models like the visual cortex, they would be vastly larger.

43d3541

gavin leech (Non-Reasoning)@gleech

@sichuan_mala there are 22B ViTs but they're not dominant or strictly necessary in the way you'd expect

43d1335

Rusty Shackleford@saintMarxPlace

@gleech Is that a fundamental property of vision models vs text models? Or just a result of which ones have been indexed on?

43d2.6K2

Chris O.@crowd_of_one

@SoCratesNOCAP @gleech A thousand words of plain text is 8kb. Most pictures are over a 100kb.

43d524

Substrate Monopoly@Substr8Monopoly

@gleech Why would vision models be smaller than language models if language were more compressible?

43d3661

Oscar Moxon@oscarmoxon

@gleech @aliceisplaying that said, textual expressive capacity is so much deeper than vision.

43d464

Roland Bouman@rolandbouman

@SoCratesNOCAP @gleech IIUC it's more correctly to say one picture *costs* a thousand words while being worth just 1/1000th words. Here's a 4kb picture as proof.

43d372

Pavel Batuev@shalcker

@Substr8Monopoly @gleech It would be actually inverse of that - language is *already* very compressed representation, so it is much harder to compress any further. Typical image can be described by a few sentences on all major features while taking orders of magnitude more space even when compressed.

43d272

Utkarsh Singh@Utkarsh51557661

@gleech language is wild like that. text carries so much meaning compared to its size.

43d1.1K2

MoonGotArt@MoonGotArt

@DoozerDiffuser @gleech Bro u are so far behind the news cycle

43d92

gavin leech (Non-Reasoning)@gleech

@saintMarxPlace could easily be a function of differing amounts of inputs into each subfield, yes, but I don't think it is

43d1.6K1

gavin leech (Non-Reasoning)@gleech

@sichuan_mala HunyuanVideo is 13B + maybe another 10B for the audio. Sora probably a bit more.

One wrinkle is that the generators need language understanding, so there will be a real chonky LLM involved somewhere

43d673