/AI5h ago

Rohan Anil, former Google DeepMind engineer, teases a new optimizer matching Shampoo's performance gains over AdamW

Story Overview

Rohan Anil flags confirmation that an optimizer family now called Muon delivers efficiency lifts over AdamW on the same order as the earlier Shampoo gains, tracing the idea back to a 2024 conversation with Ilya Sutskever. Public benchmark runs on modded-NanoGPT place Muon ahead of both Shampoo and Adam, with the edge tied to its momentum-plus-orthogonalization approach rather than simpler variants. No information yet exists on behavior at frontier scale or any planned paper release.

33559721081.6K
Original post
rohan anil@_arohan_#79inAI

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

10:26 AM · Jun 8, 2026 · 65.6K Views
Open Question

Benchmark details stay limited to small models

Keller Jordan’s public NanoGPT runs show Muon pulling ahead while Shampoo sits roughly midway between Muon and Adam, and Spectral Descent lags by about 2x; the distinction comes from accumulation steps plus Newton-Schulz steps rather than pure per-step updates. How those orderings translate outside speedrun-scale experiments is still open.

Developer Impact

Training speed records already cite the method

Muon has appeared in recent NanoGPT and CIFAR-10 training records, suggesting some practitioners are already swapping it in where the benchmark numbers matter. Whether the same swap pays off in production pipelines or larger pretraining runs is not addressed in the current posts.

Sentiment

Many users welcomed the new optimizer's reported gains over AdamW and the Shampoo family as a welcome shift toward practical optimization advances, while some criticized the announcement as vague posting.

Pos
66.7%
Neg
33.3%
10 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS4.4KBOOKMARKS2
rohan anil@_arohan_

All or nothing aspect means there is no incremental reward which just needs faith.

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5hViews 4.4KLikes 35Bookmarks 2
LIKES40REPLIES4
rohan anil@_arohan_

Would LLMs have figured this out on their own? I am not so sure. The right prior is important.

rohan anil@_arohan_

All or nothing aspect means there is no incremental reward which just needs faith.

5hViews 4.2KLikes 40Bookmarks 2
RETWEETS1
rohan anil@_arohan_

All you have to do is believe and listen to the gradients.

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5hViews 3.5KLikes 35Bookmarks 2
rohan anil@_arohan_

@swyx It will get rejected due to lack of novelty. :)

swyx@swyx

@_arohan_ oh man please publish a. paper sometime!!

4hViews 1.3KLikes 30Bookmarks 1
swyx@swyx

@_arohan_ oh man please publish a. paper sometime!!

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

4hViews 2.7KLikes 17Bookmarks 1
rohan anil@_arohan_

@jaiselsingh Could be great near a good initialization

5hViews 301Likes 6Bookmarks 1
Keller Jordan@kellerjordan0

@_arohan_ My response to the claim that Muon is a renamed version of Shampoo

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

22mViews 143Likes 3Bookmarks 1
bilal@bilaltwovec

postconditioning achieved internally 🫪

rohan anil@_arohan_

Ilya (sorry for name dropping) met with me in 2024 and said in a meeting that we can do better than Shampoo family (renamed to Muon and friends).

Now I can say this is very true, there exist an optimizer that shows the same scale of improvements that shampoo shows over adamw on deep learning models.

Fundamental optimization continues to drive progress. Fascinatingly the work to get there is all or nothing. Nothing works under everything works.

5hViews 895Likes 8Bookmarks 0
jaisel@jaiselsingh

@_arohan_ what’s your take on zeroth-order optimization techniques? :>

5hViews 370Likes 1
altyni 🪄@0xaltyni

@_arohan_ Do you think for different types of data (e.g. coding, literature, what not) there will be different Pareto improving optimizers?

5hViews 388
rohan anil@_arohan_

@0xaltyni Data distribution definitely would have some impact on some of the choice/hypers of the optimizer.

5hViews 331Likes 4
jaisel@jaiselsingh

@_arohan_ I’m curious abt a CFD like structure where you solve a coarse problem to find the flow structure, then refine where the gradients actually matter so zeroth order global search and then refinement

5hViews 81
rohan anil@_arohan_

@a_karvonen I really did like parallax runs on it that was posted recently there.

4hViews 85Likes 3
Eitan Porat@PoratEitan

@_arohan_ Cool! so open source it.

5hViews 24
swyx@swyx

@A_K_Nain @_arohan_ i will accept at @aidotengineer we are slutty like that

@vibhuuuus pls get for poaster sessions

2hViews 114Likes 2

@_arohan_ @swyx Hahah ICLR/NeurIPS instant reject 😂😂😂🤣🤣

4hViews 90Likes 2
Adam Karvonen@a_karvonen

@_arohan_ wen core automation nanogpt speed run record?

4hViews 7
Alex YGift@Radipdegen

@_arohan_ someone named a shampoo variant after Ilya saying it with ur own chest is crazy, respect

5hViews 137
alyxya@_alyxya

@_arohan_ adamw had the wrong theory being element wise, muon is better especially with weight decay and other tricks, but I don't think an optimizer improvement alone fixes fundamental limitations in ml theory

5hViews 130
Saurabh Bhatnagar@analyticsaurabh

@_arohan_ Congrats Looking forward to Conditioning properly

5hViews 119
Load more posts