love @dylan522p podcasts bc he just leaks lab secrets by accident as supporting evidence for arguments he’s making. like apparently gpt-4o was around 600b parameters, and openai models are now much sparser than anthropic models
Analyst Dylan Patel says OpenAI’s GPT-4o has 600 billion parameters, using a sparser architecture than Anthropic’s models
Story Overview
Analyst Dylan Patel dropped casual remarks during a podcast about GPT-4o totaling roughly 600 billion parameters and running on a noticeably sparser Mixture-of-Experts setup than Anthropic's designs, framing the choice as a move toward cheaper inference and better NVIDIA hardware alignment. The comments reference older model comparisons like DeepSeek-V3 but carry no fresh product details or official backing from OpenAI.
Podcast remarks leave source unclear
Listeners are still parsing whether Patel's offhand figures came from supply-chain sleuthing, an inadvertent slip by someone in the know, or simple speculation, with no confirmation or denial issued by any lab involved.
Sparser designs could trim running costs
Patel noted OpenAI has leaned harder into sparsity lately than its peers, which might lower active-parameter counts during inference, yet the precise ratios and performance trade-offs stay unverified.
Positive users praised the podcast leaks on GPT-4o parameters and sparsity as entertaining, while negative users criticized the disclosures as tattling or potentially devastating for research programs and OpenAI pretraining.
No Digg Deeper questions have been answered for this story yet.
Most Activity
…would be devastating if it turned out that sparsity just doesn't scale and Ant brute forced to True Science of Pretraining on chonkers. (almost inconceivable on priors, but there are enough contingent choices in conventional MoE design that it could be true. Router, damn you…)
love @dylan522p podcasts bc he just leaks lab secrets by accident as supporting evidence for arguments he’s making. like apparently gpt-4o was around 600b parameters, and openai models are now much sparser than anthropic models
@willccbb @dylan522p "accident"
love @dylan522p podcasts bc he just leaks lab secrets by accident as supporting evidence for arguments he’s making. like apparently gpt-4o was around 600b parameters, and openai models are now much sparser than anthropic models
@teortaxesTex the argument he made was about hardware efficiency, anthropic pretrains on TPUs + serves on trainium, optimal sparsity + shape is different vs what you’d do on a full GPU stack like openai
…would be devastating if it turned out that sparsity just doesn't scale and Ant brute forced to True Science of Pretraining on chonkers. (almost inconceivable on priors, but there are enough contingent choices in conventional MoE design that it could be true. Router, damn you…)
@willccbb @dylan522p Is he the one leaking or are people leaking to him
love @dylan522p podcasts bc he just leaks lab secrets by accident as supporting evidence for arguments he’s making. like apparently gpt-4o was around 600b parameters, and openai models are now much sparser than anthropic models

@willccbb …If trainium happens to incentivize a superior (denser) architecture, that'd be devastating in its own way

@willccbb @dylan522p I think Microsoft already accidentally leaked that particular secret months ago.

@willccbb @dylan522p One of the few people whose podcast I don't outsource to LLM to watch

@teortaxesTex NVL72 is just really optimized for expert parallel i think

@willccbb @dylan522p Can you get him to spill on GPT-OSS-2? Maybe just say something wrong about it so he corrects you 🤓

@willccbb @dylan522p well don’t alert him to this

@willccbb @dylan522p tattletale

@willccbb @dylan522p "openai models are now much sparser than anthropic models" - isn't that obvious already from the price per token and the model feel?

@teortaxesTex I meant atleast mythos would be like 1/16 active params right, maybe not 1/64

@willccbb @dylan522p No need to guess, its quite easy to derive. For openai models, it not super hard to deduce basic things from numbers of cerebras. For anthropic, same thing, everybody knows claude power consumption because google knows it.

@willccbb @dylan522p Is that why so much more token efficient?

@willccbb @dylan522p the accidental leaks always seem better sourced than the official ones lol

@teortaxesTex would be bad for DeepSeek research program

@willccbb @dylan522p 这爆料密度有点太适合下饭了

@teortaxesTex Wut, sparsity was always a trade off, not sure what this post is supposed to mean. We knew for a long time that OAI failed in pretraining which made them take a step back for GPT-5.

@teortaxesTex Sparsity not scaling would be counter intuitive. You would expect sparsity to be the thing that scales. What am I missing