/Tech1h ago

Analyst Dylan Patel says OpenAI’s GPT-4o has 600 billion parameters, using a sparser architecture than Anthropic’s models

Story Overview

Analyst Dylan Patel dropped casual remarks during a podcast about GPT-4o totaling roughly 600 billion parameters and running on a noticeably sparser Mixture-of-Experts setup than Anthropic's designs, framing the choice as a move toward cheaper inference and better NVIDIA hardware alignment. The comments reference older model comparisons like DeepSeek-V3 but carry no fresh product details or official backing from OpenAI.

206471920233.9K

#271

Original post

will brown@willccbb#573inTech

love @dylan522p podcasts bc he just leaks lab secrets by accident as supporting evidence for arguments he’s making. like apparently gpt-4o was around 600b parameters, and openai models are now much sparser than anthropic models

8:50 AM · Jul 2, 2026 · 29.5K Views

Open Question

Podcast remarks leave source unclear

Listeners are still parsing whether Patel's offhand figures came from supply-chain sleuthing, an inadvertent slip by someone in the know, or simple speculation, with no confirmation or denial issued by any lab involved.

Cost Pressure

Sparser designs could trim running costs

Patel noted OpenAI has leaned harder into sparsity lately than its peers, which might lower active-parameter counts during inference, yet the precise ratios and performance trade-offs stay unverified.

Sentiment

Positive users praised the podcast leaks on GPT-4o parameters and sparsity as entertaining, while negative users criticized the disclosures as tattling or potentially devastating for research programs and OpenAI pretraining.

Pos

33.4%

Neg

66.6%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.8KBOOKMARKS10RETWEETS1REPLIES5

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

…would be devastating if it turned out that sparsity just doesn't scale and Ant brute forced to True Science of Pretraining on chonkers. (almost inconceivable on priors, but there are enough contingent choices in conventional MoE design that it could be true. Router, damn you…)

will brown@willccbb

1h3.8K3010

LIKES69

kache@yacineMTB

@willccbb @dylan522p "accident"

will brown@willccbb

1h2.2K692

will brown@willccbb

@teortaxesTex the argument he made was about hardware efficiency, anthropic pretrains on TPUs + serves on trainium, optimal sparsity + shape is different vs what you’d do on a full GPU stack like openai

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

44m606100

Hamel Husain@HamelHusain

@willccbb @dylan522p Is he the one leaking or are people leaking to him

will brown@willccbb

1h1.4K100

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@willccbb …If trainium happens to incentivize a superior (denser) architecture, that'd be devastating in its own way

43m1376

Jeffrey Emanuel@doodlestein

@willccbb @dylan522p I think Microsoft already accidentally leaked that particular secret months ago.

45m3801

Ziliang Peng 🏴‍☠️@cyberpengk

@willccbb @dylan522p One of the few people whose podcast I don't outsource to LLM to watch

1h792

will brown@willccbb

@teortaxesTex NVL72 is just really optimized for expert parallel i think

41m652

oso@osoleve

@willccbb @dylan522p Can you get him to spill on GPT-OSS-2? Maybe just say something wrong about it so he corrects you 🤓

1h1281

unintentionally pursuing jesus@braadleeyy_

@willccbb @dylan522p well don’t alert him to this

41m381

arb8020@arb8020

@willccbb @dylan522p tattletale

1h1081

mark erdmann@markerdmann

@willccbb @dylan522p "openai models are now much sparser than anthropic models" - isn't that obvious already from the price per token and the model feel?

41m318

Ankith 🐋/acc@dhtikna

@teortaxesTex I meant atleast mythos would be like 1/16 active params right, maybe not 1/64

1h242

Noob@pretrainguy

@willccbb @dylan522p No need to guess, its quite easy to derive. For openai models, it not super hard to deduce basic things from numbers of cerebras. For anthropic, same thing, everybody knows claude power consumption because google knows it.

1h621

DavidHummel@DavidHumme5859

@willccbb @dylan522p Is that why so much more token efficient?

1h153

TheCoderBTW@TheCoderBtw

@willccbb @dylan522p the accidental leaks always seem better sourced than the official ones lol

18m61

Buddy vanderBuddy@ptntlbyrnths

@teortaxesTex would be bad for DeepSeek research program

1h181

安叫兽|Bird🕊️ 🔶 BNB@ajs6888

@willccbb @dylan522p 这爆料密度有点太适合下饭了

19m54

Charuru Charuru@CharuruCha14310

@teortaxesTex Wut, sparsity was always a trade off, not sure what this post is supposed to mean. We knew for a long time that OAI failed in pretraining which made them take a step back for GPT-5.

41m22

Christopher Malili@Christo31306687

@teortaxesTex Sparsity not scaling would be counter intuitive. You would expect sparsity to be the thing that scales. What am I missing

33m13