/Tech2d ago

Keller Jordan, Muon optimizer creator, updates Modded-NanoGPT benchmark, showing Muon and Nesterov-boosted Shampoo beat standard Adam

AI Judge changed title after evaluation, original title: "Muon creator Keller Jordan updates his Modded-NanoGPT benchmark, showing Shampoo outperforms Adam but trails Muon in training steps"

Story Overview

Keller Jordan added Shampoo and Spectral descent runs to the public Modded-NanoGPT track, placing the new results on the same chart that already tracked Adam and his own Muon optimizer for the 124M model.

991.9K78753305.8K

Original post

Keller Jordan@kellerjordan0#453inTech

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

12:11 PM · Jun 8, 2026 · 58K Views

FYI

Step Counts Reveal Clear Ordering

Muon reached the target validation loss in 3325 steps, Shampoo followed at 4100, Adam needed 4875, and Spectral descent required roughly 8225.

Open Question

Links to Earlier Preconditioning Papers

Attached references include the original Shampoo arXiv and spectral descent work, while community replies note ongoing debate over how closely Muon matches certain Shampoo variants.

Sentiment

Users praised the NanoGPT benchmark threads on Shampoo and Muon optimizers because the public code diffs, hyperparameter details, and loss-curve analysis made the distinctions concrete and practically useful.

Pos

94.5%

Neg

5.5%

17 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS117.1KBOOKMARKS336LIKES424RETWEETS38REPLIES21

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

1d117.1K424336

Rohan Pandey@khoomeik

x dot com in 2026 is miles ahead of slopmaxxed academic peer review culture

feels like im back in the 17th century watching newton & leibniz argue via public letters

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d27.5K385109

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet.

The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d41.3K25586

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

2d29.1K14159

Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

Keller Jordan@kellerjordan0

One motivation for me to add these optimizers was seeing @_arohan_, a senior researcher whose contributions I respect, repeatedly claim that Muon is Shampoo.

IIUC, his argument is that if we disable accumulation in both Muon and Shampoo, then they become... 4/6

2d20.2K13635

rohan anil@_arohan_

This is like a damping vs pseudo inverse question in optimization.

Pseudoinverse is a hard spectral cutoff; Tikhonov/damping is a soft spectral filter.

Does this imply batch size is not large enough for these stronger optimization to start mattering?

This would just means one should make a larger batch speed run

* some related work

https://ww3.math.ucla.edu/camreport/cam89-04.pdf

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

1d13.2K5642

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d18.3K6721

Yaroslav Bulatov@yaroslavvb

I went to a few optimization talks at 2018 ICML and I'm also reminded of the GGT optimizer (same as Shampoo^2?). In my mind the biggest deviation of Muon is that breaks with the "preconditioning" view. It's not preconditioning to normalize gradient by gradient in the same batch. The preconditioning direction is still open, how to best utilize correlation statistics from training history

Konstantin Mishchenko@konstmish

@kellerjordan0 @_arohan_ Muon and Shampoo are indeed related but definitely not to the extent to call them the same.

1d3.8K3424

Keller Jordan@kellerjordan0

Citations:

Shampoo is Gupta et al. (2018) https://arxiv.org/abs/1802.09568 and Anil et al. (2020) https://arxiv.org/abs/2002.09018

Spectral descent is Carlson et al. (2015a) https://proceedings.mlr.press/v38/carlson15.html and Carlson et al. (2015b) https://papers.nips.cc/paper_files/paper/2015/hash/f50a6c02a3fc5a3a5d4d9391f05f3efc-Abstract.html 6/6

Keller Jordan@kellerjordan0

...equivalent. This is correct. The problem is that Muon without accumulation is not Muon: It is Spectral Descent, which is >2x slower.

To go fast we need accumulation, and -- as shown in the figure -- the way it's added is what makes the difference between Muon and Shampoo. 5/6

2d5.1K5819

Keller Jordan@kellerjordan0

I appreciate this result provided by @_arohan_, and look forward to fully understanding the algorithm used here once his reproducible log becomes available.

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d8.2K889

rohan anil@_arohan_

@kellerjordan0 Yes! I will send a patch. The comments should be good to recreate it if you want it asap. Only thing thats unspecified was lr and wd, but you can calculate that based on the original lr, and ratio of sqrt(size)/sqrt(row/col) thing. Just make sure you update weight decay.

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

1d5K537

Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

Keller Jordan@kellerjordan0

I've added two optimizers to the public benchmark:

(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).

Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.

Thread below 1/6

2d5.1K417

rohan anil@_arohan_

Second one is that you do Nesterov momentum from Ilya/Marten/Hinton’s paper on importance of momentum from 2011 and pass that into shampoo.

rohan anil@_arohan_

First one is: don't take square roots of small eps. do psuedo inverse

inv_power_L = L.pow(-1.0 / root)

to:

positive_eigenvalue_mask = L > 1e-15 inv_power_L = torch.zeros_like(L) inv_power_L[positive_eigenvalue_mask] = L[positive_eigenvalue_mask].pow(-1.0 / root)

1d5.2K516

neon@neonkeysf

I checked god’s plan for us and it involves Nesterov momentum

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d5.1K334

Keller Jordan@kellerjordan0

For this comparison, I kept Shampoo's exponent at its original value of 1/4. It is well-known that the modern "Shampoo^2" variant which uses 1/2 is more efficient, but this modification breaks the mathematical relationship to Muon and Spectral Descent, so I kept the original. 3/6

Keller Jordan@kellerjordan0

Reproducible logs:

Shampoo: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260513_shampoo_1_4_power/503575c5-6dde-425a-b461-2df4d99db974.txt Spectral descent: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_3_optimization/results/20260517_ortho/d5098d67-7c1b-47b4-8833-80960d633d33.txt

As part of the public benchmark, further hyperparameter improvements are welcomed for any of these runs. All four use the same WSD lr schedule. 2/6

2d4.8K373

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Here is my initial comment while awaiting the reproducible logfiles of these runs

Keller Jordan@kellerjordan0

Thank you for this result; I appreciate it. Here's one initial correction:

In his post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research - for which it so happens that @_arohan_ has contributing credits.

If there are indeed bugs in this official implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

1d2.3K117

rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

rohan anil@_arohan_

Third one is that that if [m, n] tensor take the side thats larger, especially good because other there are some numerical instability, since its low rank/ plus indefiniteness creeps in. Didnt spend too much time.

1d6.5K381

Will Held@WilliamBarrHeld

While most of the 👀 on this is for the drama, I think the back and forth is a case study in how a good shared setup is the best forcing function to push towards clarity for all!

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d3.4K361

Keller Jordan@kellerjordan0

@_arohan_ Thanks for this result. Can you please provide us with the reproducible logfile generated by the run? I would like to understand the algorithm you used in full detail

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d5.1K470

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research - for which it so happens that @_arohan_ has contributing credits.

If there are indeed bugs in this official implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

1d1K173