12.30pm today on the @Microsoft Build stream
@NoPriorsPod x @latentspacepod x @satyanadella
Join us! :)
Really excited to interview @satyanadella tomorrow at @Microsoft Build, in a special live @NoPriorsPod. Questions?
AI Judge changed title after evaluation, original title: "Shawn Wang, Latent Space host, to interview Microsoft CEO Satya Nadella at Microsoft Build today"
Analysts estimate the Mythos model utilized 6.1x10^27 FLOPs.
12.30pm today on the @Microsoft Build stream
@NoPriorsPod x @latentspacepod x @satyanadella
Join us! :)
Really excited to interview @satyanadella tomorrow at @Microsoft Build, in a special live @NoPriorsPod. Questions?
Negative users slammed the Mythos FLOP leaks and charts at Microsoft Build as epic failures or rage bait while positive users praised the in-house training achievements and speaker panels.
No Digg Deeper questions have been answered for this story yet.
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
12.30pm today on the @Microsoft Build stream
@NoPriorsPod x @latentspacepod x @satyanadella
Join us! :)
Microsoft leaked the training FLOPS for Claude Mythos
based on their slide Claude Mythos used: 6.1*10^27 FLOPs
(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)
the guys she tells you not to worry about
Microsoft leaked the training FLOPS for Claude Mythos
based on their slide Claude Mythos used: 6.1*10^27 FLOPs
(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)
Ok what? Same Training FLOPs as Gemini 3.1 pro?
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
you have to give @MicrosoftAI props for training all these in-house from scratch and getting ALL of them to near-SOTA.
Mustafa built a full fledged neolab inside Microsoft in 2 years, that now MS fully controls from chip to model to harness. Absurdly impressive.
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
My Claude Mythos best guess for params, compute and data:
10.60T @ 530B params 6.74e26 FLOP 212T training tokens
The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.
Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!
Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.
My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.
---
Here's how I got these numbers:
AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"
In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".
And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."
According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.
The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).
the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU
the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP
on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP
My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)
---
My initial guess is that total training tokens were between 100T and 300T.
Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.
Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.
So we have: Equation 1: N_active = C/6D
Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400
Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)
Now we can just plug in whatever we think the compute budget was.
With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:
7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens
I think Anthropic likely trained the Mythos base model from roughly October to December using on the order of 6.7e26–1.0e27 flops
Since then, the RL-to-base-model-training flop ratio is plausibly somewhere around 0.5 and 3, depending on how much of the expanded Trainium 2 fleet was actually allocated to Mythos RL.
The reason this range is plausible is that public AWS/Anthropic statements imply Anthropic-accessible Trainium2 capacity grew from roughly 500k chips around Rainier’s launch to over 1M Trainium2 chips for Claude training and serving.
The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.
Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!
Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.
My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.
---
Here's how I got these numbers:
AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"
In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".
And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."
According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.
The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).
the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU
the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP
on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP
My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)
---
My initial guess is that total training tokens were between 100T and 300T.
Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.
Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.
So we have: Equation 1: N_active = C/6D
Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400
Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)
Now we can just plug in whatever we think the compute budget was.
With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:
7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens
I think everyone assumes this X *10e26 on pretraining and 10x of that on RL
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
Remember when people were seriously discussing 10^24 flops as a serious risk
It's interesting how many orders of magnitude people can be wrong by without updating their fundamental beliefs
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
maybe these are just estimates
because I'm not sure all of the other blue dots make sense
Microsoft leaked the training FLOPS for Claude Mythos
based on their slide Claude Mythos used: 6.1*10^27 FLOPs
(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)
"claude generate randomish dots around the trendline" We need better PPT culture
Microsoft leaked the training FLOPS for Claude Mythos
based on their slide Claude Mythos used: 6.1*10^27 FLOPs
(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)
big if true
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
Correction, the numbers are bogus:
The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.
Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!
Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.
My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.
---
Here's how I got these numbers:
AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"
In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".
And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."
According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.
The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).
the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU
the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP
on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP
My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)
---
My initial guess is that total training tokens were between 100T and 300T.
Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.
Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.
So we have: Equation 1: N_active = C/6D
Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400
Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)
Now we can just plug in whatever we think the compute budget was.
With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:
7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens
@scaling01 I think this looks pretty vibe-drawn to me.
Microsoft leaked the training FLOPS for Claude Mythos
based on their slide Claude Mythos used: 6.1*10^27 FLOPs
(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

@saranormous codex is agi man
https://youtu.be/cFNI2FORAc0
oneshotted this, no notes

@scaling01 Looks roughly the same as Gemini 3.1 pro, which was estimated to be 50T param
@xlr8harder and I have changed my fundamental beliefs
me@2024: Powerful AIs might all be misaligned; let’s help humanity coordinate on formal verification and strict boxing
me@2026: Too late! Powerful AIs are ~here, and some are open-weights. But some are aligned! Let’s help *them* cooperate on formal verification and cybersecurity

@swyx Ah yes, I really enjoyed using that model which was trained with more compute than Mythos in 2024.
@swyx Probably just an estimate just like how people thought a Microsoft paper leaked gpt-4o param count but was just an estimate
uhhh
did Mustafa just leak the Mythos FLOP count??
was this public knowledge before, even if its an estimate i dont get what you gain out of this
@xlr8harder I think the specific 10^24 ops number originates with me
I fully agree. Roughly, this threshold should be when any single number has more than 10²⁴ ALU operations, or 10²⁷ logic gates, in its entire causal history.
GPT-3, AlphaFold 2, Stable Diffusion, LLaMa, Dromedary: below the line.
GPT-4, PaLM 2, Claude-Next: over the line.