/AI11h ago

Muon creator Keller Jordan updates his Modded-NanoGPT benchmark, showing Shampoo outperforms Adam but trails Muon in training steps

Story Overview

Keller Jordan added Shampoo and Spectral descent runs to the public Modded-NanoGPT track, placing the new results on the same chart that already tracked Adam and his own Muon optimizer for the 124M model.

517494127281.2K

Original post

Keller Jordan@kellerjordan0#426inAI

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

12:11 PM · Jun 8, 2026 · 25.4K Views

FYI

Step Counts Reveal Clear Ordering

Muon reached the target validation loss in 3325 steps, Shampoo followed at 4100, Adam needed 4875, and Spectral descent required roughly 8225.

Open Question

Links to Earlier Preconditioning Papers

Attached references include the original Shampoo arXiv and spectral descent work, while community replies note ongoing debate over how closely Muon matches certain Shampoo variants.

Sentiment

Users appreciate the Shampoo Optimizer NanoGPT benchmarks for delivering useful optimizer comparisons and satisfying hyperparameter debugging insights.

Pos

100.0%

Neg

0.0%

8 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS19.2K

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

11h19.2K11446

BOOKMARKS55RETWEETS10REPLIES11

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

1h6.9K8955

LIKES119

Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

11h14.5K11933

Keller Jordan@kellerjordan0

Citations:

Shampoo is Gupta et al. (2018) https://arxiv.org/abs/1802.09568 and Anil et al. (2020) https://arxiv.org/abs/2002.09018

Spectral descent is Carlson et al. (2015a) https://proceedings.mlr.press/v38/carlson15.html and Carlson et al. (2015b) https://papers.nips.cc/paper_files/paper/2015/hash/f50a6c02a3fc5a3a5d4d9391f05f3efc-Abstract.html 6/6

Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

11h3.7K5112

Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

11h3.5K397

Yaroslav Bulatov@yaroslavvb

I went to a few optimization talks at 2018 ICML and I'm also reminded of the GGT optimizer (same as Shampoo^2?). In my mind the biggest deviation of Muon is that breaks with the "preconditioning" view. It's not preconditioning to normalize gradient by gradient in the same batch. The preconditioning direction is still open, how to best utilize correlation statistics from training history

Konstantin Mishchenko@konstmish

@kellerjordan0 @_arohan_ Muon and Shampoo are indeed related but definitely not to the extent to call them the same.

7h1.1K147

Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

11h3.4K343

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1h1.4K245

rohan anil@_arohan_

Second one is that you do Nesterov momentum from Ilya/Marten/Hinton’s paper on importance of momentum from 2011 and pass that into shampoo.

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

1h854182

Rohan Pandey@khoomeik

x dot com in 2026 is miles ahead of slopmaxxed academic peer review culture

feels like im back in the 17th century watching newton & leibniz argue via public letters

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

24m597123

Ethan@torchcompiled

@kellerjordan0 @jeffreycider @_arohan_ This paper suggests Muon is performing the input side whitening of shampoo, so like “Left-side-only-shampoo”

https://arxiv.org/abs/2604.01472

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

7h1.8K83

mikail@Gradientdinner

@kellerjordan0 How about this SOAP-ified Muon/ Muon-ified SOAP: https://github.com/NVIDIA-NeMo/Emerging-Optimizers/blob/main/emerging_optimizers/soap/moso.py

9h304101

rohan anil@_arohan_

Third one is that that if [m, n] tensor take the side thats larger, especially good because other there are some numerical instability, since its low rank/ plus indefiniteness creeps in. Didnt spend too much time.

59m4149

rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

rohan anil@_arohan_

Third one is that that if [m, n] tensor take the side thats larger, especially good because other there are some numerical instability, since its low rank/ plus indefiniteness creeps in. Didnt spend too much time.

58m760120

Konstantin Mishchenko@konstmish

@kellerjordan0 @_arohan_ Muon and Shampoo are indeed related but definitely not to the extent to call them the same.

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

9h1K31

rohan anil@_arohan_

To be clear it’s not a new record or anything this is all noise.

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

37m49021

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

34m42350

rohan anil@_arohan_

@kellerjordan0 Yes! I will send a patch. The comments should be good to recreate it if you want it asap. Only thing thats unspecified was lr and wd, but you can calculate that based on the original lr, and ratio of sqrt(size)/sqrt(row/col) thing. Just make sure you update weight decay.

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

31m39450

rohan anil@_arohan_

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

57m4497

rohan anil@_arohan_

@plugyawn I have mentioned these often in talks.

Everyone comes complaining Shampoo doesn’t work.

42m1363