/AI11h ago

Muon creator Keller Jordan updates his Modded-NanoGPT benchmark, showing Shampoo outperforms Adam but trails Muon in training steps

Story Overview

Keller Jordan added Shampoo and Spectral descent runs to the public Modded-NanoGPT track, placing the new results on the same chart that already tracked Adam and his own Muon optimizer for the 124M model.

517494127281.2K
Original post
Keller Jordan@kellerjordan0#426inAI

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

12:11 PM · Jun 8, 2026 · 25.4K Views
FYI

Step Counts Reveal Clear Ordering

Muon reached the target validation loss in 3325 steps, Shampoo followed at 4100, Adam needed 4875, and Spectral descent required roughly 8225.

Open Question

Links to Earlier Preconditioning Papers

Attached references include the original Shampoo arXiv and spectral descent work, while community replies note ongoing debate over how closely Muon matches certain Shampoo variants.

Sentiment

Users appreciate the Shampoo Optimizer NanoGPT benchmarks for delivering useful optimizer comparisons and satisfying hyperparameter debugging insights.

Pos
100.0%
Neg
0.0%
8 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS19.2K
Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

11hViews 19.2KLikes 114Bookmarks 46
BOOKMARKS55RETWEETS10REPLIES11
rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

1hViews 6.9KLikes 89Bookmarks 55
LIKES119
Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

11hViews 14.5KLikes 119Bookmarks 33
Keller Jordan@kellerjordan0

Citations:

Shampoo is Gupta et al. (2018) https://arxiv.org/abs/1802.09568 and Anil et al. (2020) https://arxiv.org/abs/2002.09018

Spectral descent is Carlson et al. (2015a) https://proceedings.mlr.press/v38/carlson15.html and Carlson et al. (2015b) https://papers.nips.cc/paper_files/paper/2015/hash/f50a6c02a3fc5a3a5d4d9391f05f3efc-Abstract.html 6/6

Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

11hViews 3.7KLikes 51Bookmarks 12
Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

11hViews 3.5KLikes 39Bookmarks 7
Yaroslav Bulatov@yaroslavvb

I went to a few optimization talks at 2018 ICML and I'm also reminded of the GGT optimizer (same as Shampoo^2?). In my mind the biggest deviation of Muon is that breaks with the "preconditioning" view. It's not preconditioning to normalize gradient by gradient in the same batch. The preconditioning direction is still open, how to best utilize correlation statistics from training history

@kellerjordan0 @_arohan_ Muon and Shampoo are indeed related but definitely not to the extent to call them the same.

7hViews 1.1KLikes 14Bookmarks 7
Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

11hViews 3.4KLikes 34Bookmarks 3
rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1hViews 1.4KLikes 24Bookmarks 5
rohan anil@_arohan_

Second one is that you do Nesterov momentum from Ilya/Marten/Hinton’s paper on importance of momentum from 2011 and pass that into shampoo.

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

1hViews 854Likes 18Bookmarks 2
Rohan Pandey@khoomeik

x dot com in 2026 is miles ahead of slopmaxxed academic peer review culture

feels like im back in the 17th century watching newton & leibniz argue via public letters

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

24mViews 597Likes 12Bookmarks 3
Ethan@torchcompiled

@kellerjordan0 @jeffreycider @_arohan_ This paper suggests Muon is performing the input side whitening of shampoo, so like “Left-side-only-shampoo”

https://arxiv.org/abs/2604.01472

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

7hViews 1.8KLikes 8Bookmarks 3
mikail@Gradientdinner

@kellerjordan0 How about this SOAP-ified Muon/ Muon-ified SOAP: https://github.com/NVIDIA-NeMo/Emerging-Optimizers/blob/main/emerging_optimizers/soap/moso.py

9hViews 304Likes 10Bookmarks 1
rohan anil@_arohan_

Third one is that that if [m, n] tensor take the side thats larger, especially good because other there are some numerical instability, since its low rank/ plus indefiniteness creeps in. Didnt spend too much time.

59mViews 414Likes 9
rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

rohan anil@_arohan_

Third one is that that if [m, n] tensor take the side thats larger, especially good because other there are some numerical instability, since its low rank/ plus indefiniteness creeps in. Didnt spend too much time.

58mViews 760Likes 12Bookmarks 0

@kellerjordan0 @_arohan_ Muon and Shampoo are indeed related but definitely not to the extent to call them the same.

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

9hViews 1KLikes 3Bookmarks 1
rohan anil@_arohan_

To be clear it’s not a new record or anything this is all noise.

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

37mViews 490Likes 2Bookmarks 1
Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

34mViews 423Likes 5Bookmarks 0
rohan anil@_arohan_

@kellerjordan0 Yes! I will send a patch. The comments should be good to recreate it if you want it asap. Only thing thats unspecified was lr and wd, but you can calculate that based on the original lr, and ratio of sqrt(size)/sqrt(row/col) thing. Just make sure you update weight decay.

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

31mViews 394Likes 5Bookmarks 0
rohan anil@_arohan_

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

57mViews 449Likes 7
rohan anil@_arohan_

@plugyawn I have mentioned these often in talks.

Everyone comes complaining Shampoo doesn’t work.

42mViews 136Likes 3
Load more posts