Microsoft Build slide plotting unreleased "Opus 4.6" and "Mythos" models sparks debate over leaked 2026 compute scales · Digg

Microsoft Build slide plotting unreleased "Opus 4.6" and "Mythos" models sparks debate over leaked 2026 compute scales · Digg

Posts from X

Most Activity

VIEWS425.4KBOOKMARKS429REPLIES55

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

swyx 🔜 @aiDotEngineer@swyx

12.30pm today on the @Microsoft Build stream

@NoPriorsPod x @latentspacepod x @satyanadella

Join us! :)

21d425.4K915429

LIKES993RETWEETS39

Lisan al Gaib@scaling01

Microsoft leaked the training FLOPS for Claude Mythos

based on their slide Claude Mythos used: 6.1*10^27 FLOPs

(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

21d284.8K993368

Lisan al Gaib@scaling01

the guys she tells you not to worry about

Lisan al Gaib@scaling01

Microsoft leaked the training FLOPS for Claude Mythos

based on their slide Claude Mythos used: 6.1*10^27 FLOPs

(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

21d144.5K937132

Chubby♨️@kimmonismus

Ok what? Same Training FLOPs as Gemini 3.1 pro?

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

20d86.2K51191

swyx 🔜 @aiDotEngineer@swyx

you have to give @MicrosoftAI props for training all these in-house from scratch and getting ALL of them to near-SOTA.

Mustafa built a full fledged neolab inside Microsoft in 2 years, that now MS fully controls from chip to model to harness. Absurdly impressive.

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

21d25.3K20944

Lisan al Gaib@scaling01

My Claude Mythos best guess for params, compute and data:

10.60T @ 530B params 6.74e26 FLOP 212T training tokens

Lisan al Gaib@scaling01

The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.

Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!

Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.

My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.

---

Here's how I got these numbers:

AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"

In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".

And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."

According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.

The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).

the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU

the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP

on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP

My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)

---

My initial guess is that total training tokens were between 100T and 300T.

Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.

Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.

So we have: Equation 1: N_active = C/6D

Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400

Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)

Now we can just plug in whatever we think the compute budget was.

With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:

7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens

20d24.4K20644

Lisan al Gaib@scaling01

I think Anthropic likely trained the Mythos base model from roughly October to December using on the order of 6.7e26–1.0e27 flops

Since then, the RL-to-base-model-training flop ratio is plausibly somewhere around 0.5 and 3, depending on how much of the expanded Trainium 2 fleet was actually allocated to Mythos RL.

The reason this range is plausible is that public AWS/Anthropic statements imply Anthropic-accessible Trainium2 capacity grew from roughly 500k chips around Rainier’s launch to over 1M Trainium2 chips for Claude training and serving.

Lisan al Gaib@scaling01

The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.

Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!

Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.

My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.

---

Here's how I got these numbers:

AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"

In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".

And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."

According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.

The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).

the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU

the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP

on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP

My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)

---

My initial guess is that total training tokens were between 100T and 300T.

Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.

Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.

So we have: Equation 1: N_active = C/6D

Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400

Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)

Now we can just plug in whatever we think the compute budget was.

With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:

7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens

20d22.9K18147

Zephyr@zephyr_z9

I think everyone assumes this X *10e26 on pretraining and 10x of that on RL

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

21d28.2K9635

xlr8harder@xlr8harder

Remember when people were seriously discussing 10^24 flops as a serious risk

It's interesting how many orders of magnitude people can be wrong by without updating their fundamental beliefs

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

21d15.1K16813

Lisan al Gaib@scaling01

maybe these are just estimates

because I'm not sure all of the other blue dots make sense

Lisan al Gaib@scaling01

Microsoft leaked the training FLOPS for Claude Mythos

based on their slide Claude Mythos used: 6.1*10^27 FLOPs

(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

21d20.4K1013

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

"claude generate randomish dots around the trendline" We need better PPT culture

Lisan al Gaib@scaling01

Microsoft leaked the training FLOPS for Claude Mythos

based on their slide Claude Mythos used: 6.1*10^27 FLOPs

(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

20d4.8K668

ben hylak@benhylak

big if true

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

21d6.7K199

Lisan al Gaib@scaling01

Correction, the numbers are bogus:

Lisan al Gaib@scaling01

The 6.1e27 FLOP figure for Claude Mythos is not realistic. The Microsoft intern just threw some darts.

Instead, let me throw some darts. This is my realistic estimate for Claude Mythos compute, total params, active params and training tokens!

Mythos was likely a training run between 3.37e26 and 1.46e27 FLOP, with 7.5T–15.6T total parameters and 375B–780B active parameters and 150T–312T training tokens.

My median case is 6.74e26 FLOP, 212T training tokens, 10.60T@530B.

---

Here's how I got these numbers:

AWS CEO Matt Garman said in an interview with CNBC Television on October 17th 2025 that "the next generation of models [are] all being built on Trainium 2 as part of this giant cluster"

In the same Interview he also confirmed that "they are already running [...] about 500.000 [Trainium2] chips in Indiana today".

And on Feb 5th 2025 from Amazons Q4 Earnings: "Trainium2 powers Project Rainier, the world’s largest operational AI compute cluster with 500,000+ Trainium2 chips, which Anthropic is using to train its industry-leading AI model, Claude."

According to Anthropic an early version of Mythos was available for internal use since February 24th 2026.

The largest known MoE training run Llama-4 Behemoth (2T MoE) had a 19.7% MFU on >100k H100s and Microsoft just reported slightly over 20% MFU on 8192 GB200s for their MAI Thinking 1 model (1T MoE).

the assumptions: - training took more than 1 month, but less than the time span between Oct 17th to Feb 24th - 1.3 PFLOPs using FP8 Trainium 2 chips - conservative 20% MFU

the low end with only 500k chips for 1 month implies: - 3.37e26 FLOP

on the high end with 500k from Oct 17th until Feb 24th: - 1.46e27 FLOP

My middle scenario is 500k chips for 60-90 days, implying 6.74e26 to 1.01e27 FLOP (I don't think any sane lab would train for more than 3 months nowadays given the pace of progress, and even 90 days feels like a lot. I think it's closer to 60, but MFU is likely closer to 30% than 20%, so I will take both of these estimates as my base cases.)

---

My initial guess is that total training tokens were between 100T and 300T.

Personally, I also don't think Anthropic is sparsity-maxxing. the same way Google is. They clearly don't care that much about inference efficiency, but more about model/reasoning performance.

Recent papers also put the optimal total/active ratio at 18-25 (https://arxiv.org/pdf/2603.21862v1), while other papers mention that reasoning ability peaks around 20 tokens per parameter (https://arxiv.org/pdf/2508.18672). I think 5% sparsity is reasonable.

So we have: Equation 1: N_active = C/6D

Our 5% sparsity and 20 TPP gives us: N_total = D/20 and N_active = 0.05 * N_total Equation 2: N_active = D/400

Now set Eq. 1 and Eq. 2 equal: C/6D = D/400 -> D = sqrt(400C/6)

Now we can just plug in whatever we think the compute budget was.

With my lower and upper bound, and my two base case guesses we get this table, which puts Claude Mythos at:

7.5T–15.6T total parameters 375B–780B active parameters 150T–312T training tokens

20d11.5K426

Lucas Beyer (bl16)@giffmana

@scaling01 I think this looks pretty vibe-drawn to me.

Lisan al Gaib@scaling01

Microsoft leaked the training FLOPS for Claude Mythos

based on their slide Claude Mythos used: 6.1*10^27 FLOPs

(with 95% CI at 5.3*10^27 and 7.1*10^27, assuming 1 px measurement error)

20d9K681

swyx@swyx

@saranormous codex is agi man

https://youtu.be/cFNI2FORAc0

oneshotted this, no notes

20d2.4K123

Scott@scottstts

@scaling01 Looks roughly the same as Gemini 3.1 pro, which was estimated to be 50T param

21d3.7K103

davidad 🎇@davidad

@xlr8harder and I have changed my fundamental beliefs

davidad 🎇@davidad

me@2024: Powerful AIs might all be misaligned; let’s help humanity coordinate on formal verification and strict boxing

me@2026: Too late! Powerful AIs are ~here, and some are open-weights. But some are aligned! Let’s help *them* cooperate on formal verification and cybersecurity

20d984243

George Wing@george__wing

@swyx Ah yes, I really enjoyed using that model which was trained with more compute than Mythos in 2024.

21d4.6K183

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

@swyx Probably just an estimate just like how people thought a Microsoft paper leaked gpt-4o param count but was just an estimate

swyx 🔜 @aiDotEngineer@swyx

uhhh

did Mustafa just leak the Mythos FLOP count??

was this public knowledge before, even if its an estimate i dont get what you gain out of this

21d7.1K271

davidad 🎇@davidad

@xlr8harder I think the specific 10^24 ops number originates with me

davidad 🎇@davidad

I fully agree. Roughly, this threshold should be when any single number has more than 10²⁴ ALU operations, or 10²⁷ logic gates, in its entire causal history.

GPT-3, AlphaFold 2, Stable Diffusion, LLaMa, Dromedary: below the line.

GPT-4, PaLM 2, Claude-Next: over the line.

20d2.4K340