/AI5h ago

Rohan Anil, former Google DeepMind engineer, teases a new optimizer matching Shampoo's performance gains over AdamW

Story Overview

Rohan Anil flags confirmation that an optimizer family now called Muon delivers efficiency lifts over AdamW on the same order as the earlier Shampoo gains, tracing the idea back to a 2024 conversation with Ilya Sutskever. Public benchmark runs on modded-NanoGPT place Muon ahead of both Shampoo and Adam, with the edge tied to its momentum-plus-orthogonalization approach rather than simpler variants. No information yet exists on behavior at frontier scale or any planned paper release.

33559721081.6K

#79

Original post

rohan anil@_arohan_#79inAI

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

10:26 AM · Jun 8, 2026 · 65.6K Views

/AI5h ago

Rohan Anil, former Google DeepMind engineer, teases a new optimizer matching Shampoo's performance gains over AdamW

Story Overview

33559721081.6K

#79

Original post

rohan anil@_arohan_#79inAI

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

10:26 AM · Jun 8, 2026 · 65.6K Views

Open Question

Benchmark details stay limited to small models

Keller Jordan’s public NanoGPT runs show Muon pulling ahead while Shampoo sits roughly midway between Muon and Adam, and Spectral Descent lags by about 2x; the distinction comes from accumulation steps plus Newton-Schulz steps rather than pure per-step updates. How those orderings translate outside speedrun-scale experiments is still open.

Developer Impact

Training speed records already cite the method

Muon has appeared in recent NanoGPT and CIFAR-10 training records, suggesting some practitioners are already swapping it in where the benchmark numbers matter. Whether the same swap pays off in production pipelines or larger pretraining runs is not addressed in the current posts.

Sentiment

Many users welcomed the new optimizer's reported gains over AdamW and the Shampoo family as a welcome shift toward practical optimization advances, while some criticized the announcement as vague posting.

Pos

66.7%

Neg

33.3%

10 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS4.4KBOOKMARKS2

rohan anil@_arohan_

All or nothing aspect means there is no incremental reward which just needs faith.

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5h4.4K352

LIKES40REPLIES4

rohan anil@_arohan_

Would LLMs have figured this out on their own? I am not so sure. The right prior is important.

rohan anil@_arohan_

All or nothing aspect means there is no incremental reward which just needs faith.

5h4.2K402

RETWEETS1

rohan anil@_arohan_

All you have to do is believe and listen to the gradients.

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5h3.5K352

rohan anil@_arohan_

@swyx It will get rejected due to lack of novelty. :)

swyx@swyx

@_arohan_ oh man please publish a. paper sometime!!

4h1.3K301

swyx@swyx

@_arohan_ oh man please publish a. paper sometime!!

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

4h2.7K171

rohan anil@_arohan_

@jaiselsingh Could be great near a good initialization

5h30161

Keller Jordan@kellerjordan0

@_arohan_ My response to the claim that Muon is a renamed version of Shampoo

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

22m14331

bilal@bilaltwovec

postconditioning achieved internally 🫪

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5h89580

jaisel@jaiselsingh

@_arohan_ what’s your take on zeroth-order optimization techniques? :>

5h3701

altyni 🪄@0xaltyni

@_arohan_ Do you think for different types of data (e.g. coding, literature, what not) there will be different Pareto improving optimizers?

5h388

rohan anil@_arohan_

@0xaltyni Data distribution definitely would have some impact on some of the choice/hypers of the optimizer.

5h3314

jaisel@jaiselsingh

@_arohan_ I’m curious abt a CFD like structure where you solve a coarse problem to find the flow structure, then refine where the gradients actually matter so zeroth order global search and then refinement

5h81

rohan anil@_arohan_

@a_karvonen I really did like parallax runs on it that was posted recently there.

4h853

Eitan Porat@PoratEitan

@_arohan_ Cool! so open source it.

5h24

swyx@swyx

@A_K_Nain @_arohan_ i will accept at @aidotengineer we are slutty like that

@vibhuuuus pls get for poaster sessions

2h1142

Aakash Kumar Nain@A_K_Nain

@_arohan_ @swyx Hahah ICLR/NeurIPS instant reject 😂😂😂🤣🤣

4h902

Adam Karvonen@a_karvonen

@_arohan_ wen core automation nanogpt speed run record?

4h7

Alex YGift@Radipdegen

@_arohan_ someone named a shampoo variant after Ilya saying it with ur own chest is crazy, respect

5h137

alyxya@_alyxya

@_arohan_ adamw had the wrong theory being element wise, muon is better especially with weight decay and other tricks, but I don't think an optimizer improvement alone fixes fundamental limitations in ml theory

5h130

Saurabh Bhatnagar@analyticsaurabh

@_arohan_ Congrats Looking forward to Conditioning properly

5h119