/Tech11h ago

CoreAutoAI co-founder Rohan Anil shows Meta's unmodified PyTorch Shampoo package achieves competitive NanoGPT speed-run performance

Story Overview

Rohan Anil demonstrated that Meta's stock DistributedShampoo implementation from facebookresearch/optimizers can hit the modded-nanogpt target loss of 3.28 on FineWeb without any source edits. The gains came from swapping to a truncated pseudo-inverse for the preconditioner step plus a narrow set of hyperparameter values including lr=0.01, beta2=0.9, eps=1e-15 and precondition_frequency=1.

516371620594.7K
Original post
rohan anil@_arohan_#86inTech

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet.

The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

1:51 AM · Jun 10, 2026 · 33K Views
Developer Impact

Tuning revives an official package

The same package that once won AlgoPerf now matches Muon-level curves on this 124M model benchmark once the rank-deficiency issue is handled through the eigenvalue truncation trick.

Open Question

Limits remain visible

Step counts, wall-clock times and performance on larger models or different batch sizes are not reported yet, leaving open whether the config travels beyond this specific speed-run setup.

Sentiment

Users are excited about minimal config tuning enabling DistributedShampoo to match Muon performance on NanoGPT because it highlights impressive results from small hyperparameter edits rather than code rewrites.

Pos
100.0%
Neg
0.0%
8 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS19.9KLIKES127
Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

According to @_arohan_, the boost depends upon fixing a numerical linear algebra issue that he observed to occur in my initial standard DistributedShampoo run. He fixed the issue by enabling the flag rank_deficient_stability_config=PseudoInverseConfig().

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

As a result, if someone tries to train a model using DistributedShampoo without either (a) knowing about this special undocumented flag or (b) being prepared to detect and fix the numerical linear algebra issues that may occur without it, then they won't be able to achieve @_arohan_'s level of Shampoo performance. This level of effort would be considered atypical for mere hyperparameter tuning.

-- [Note on Muon baseline in plot below: Rohan's post compared Shampoo to a slightly undertuned Muon baseline from 2026/05/01, which reached the target loss in 3375 steps. This resulted in a 50-step gap between Shampoo and Muon. In the figure below I'm using the up-to-date 2026/05/03 baseline, which reaches the target in 3325 steps. This results in the step-counts exactly matching between Muon and the tuned/stabilized Shampoo variant.]

rohan anil@_arohan_

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

2hViews 19.9KLikes 127Bookmarks 33
BOOKMARKS39REPLIES9
rohan anil@_arohan_

.@kellerjordan0 posts results saying X is half way between. I educate him on optimization details, and hyper parameter tuning, and literally changes his code. I did not write any of this code.

Now he has mixed feelings and claims didn’t know Shampoo would work when he was in the same discord channel where they discussed Muon as fancy shampoo or practical shampoo in 2024. @HessianFree back me up?

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

According to @_arohan_, the boost depends upon fixing a numerical linear algebra issue that he observed to occur in my initial standard DistributedShampoo run. He fixed the issue by enabling the flag rank_deficient_stability_config=PseudoInverseConfig().

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

As a result, if someone tries to train a model using DistributedShampoo without either (a) knowing about this special undocumented flag or (b) being prepared to detect and fix the numerical linear algebra issues that may occur without it, then they won't be able to achieve @_arohan_'s level of Shampoo performance. This level of effort would be considered atypical for mere hyperparameter tuning.

-- [Note on Muon baseline in plot below: Rohan's post compared Shampoo to a slightly undertuned Muon baseline from 2026/05/01, which reached the target loss in 3375 steps. This resulted in a 50-step gap between Shampoo and Muon. In the figure below I'm using the up-to-date 2026/05/03 baseline, which reaches the target in 3325 steps. This results in the step-counts exactly matching between Muon and the tuned/stabilized Shampoo variant.]

2hViews 10.1KLikes 88Bookmarks 39
RETWEETS3
rohan anil@_arohan_

“Shampoo falls halfway between Muon & Adam” was the very confident claim.

This has been addressed. It had to be done.

Nesterov largely have not been useful in language models from my experience and now for that I got the answer as well. Not enough effort has gone to these submissions.

This is why we do tuning budgets anf made as fair comparison as possible @zacharynado

https://arxiv.org/abs/2502.15015

3hViews 1.4KLikes 28Bookmarks 18
Evan Walters@evaninwords

This is a funny statement, the speedrun was built around muon for 1.5 years, Rohan played with shampoo for 1 day 😂

Not crazy hypers either, LR .01, betas .9, wd .1...

Keller Jordan@kellerjordan0

@_arohan_ Ok, well your comparison is between an undertuned Muon and a fully-tuned Shampoo.

I guess you used result #6 logs instead of #12? Not a massive deal, but I'll make the fair comparison when I do a post

6hViews 9.1KLikes 70Bookmarks 8
rohan anil@_arohan_

@kellerjordan0 You can probably rerun it with various targets now. I listed a bunch of next steps that will be interesting. I wont be doing more work on this, since I am a bit busy building a new company :)

Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

11hViews 3KLikes 40Bookmarks 4
Keller Jordan@kellerjordan0

@_arohan_ Ok, well your comparison is between an undertuned Muon and a fully-tuned Shampoo.

I guess you used result #6 logs instead of #12? Not a massive deal, but I'll make the fair comparison when I do a post

rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

10hViews 10.7KLikes 37Bookmarks 3
Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

rohan anil@_arohan_

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

11hViews 3.2KLikes 30Bookmarks 3
rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

10hViews 1.6KLikes 27Bookmarks 1
You Jiacheng@YouJiacheng

as always, well-tuned HPs matter!

rohan anil@_arohan_

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

8hViews 1.5KLikes 17Bookmarks 3
elie@eliebakouch

@_arohan_ ok interesting was not sure if this included the previous changes as well or not thanks!

btw how do you expect change of architecture to interact with the different optimizer?

rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

10hViews 648Likes 10Bookmarks 2
George Grigorev@iamgrigorev

@kellerjordan0 relax guys, both a great optimizers, I think it's valuable to get a new piece of information

2hViews 584Likes 21
Keller Jordan@kellerjordan0

@_arohan_ Sorry wdym? The main difference is a better-tuned learning rate compared to the old baseline.

Both your logfile above and the proper Muon baseline would qualify as 3325-step runs, since your logfile hits <3.28 at that time. Tho yours would need more stat sig, but that's trivial.

rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

10hViews 1.9KLikes 15Bookmarks 0
elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

11hViews 2.9KLikes 13Bookmarks 0
You Jiacheng@YouJiacheng

@_arohan_ @kellerjordan0 my belief is that shampoo with β2 should work at least as well as muon, cuz Muon's RootInv(EMA(G)EMA(G).T) is a non-variance-damped version of one-sided shampoo.🤔

24mViews 219Likes 4
Yacine Mahdid@yacinelearning

@_arohan_ this is super cool guys lovely to see the back and forth

7hViews 142Bookmarks 1
You Jiacheng@YouJiacheng

@_arohan_ @kellerjordan0 but tbf Keller said "perform this well" instead of "work", I guess his belief was that shampoo is better than Adam but maybe worse than Muon? (not unreasonable cuz sometimes variance-damped version might be worse).

21mViews 157Likes 3
Keller Jordan@kellerjordan0

@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.

11hViews 78Likes 2
rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

10hViews 77Likes 1
rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

10hViews 56Likes 1
You Jiacheng@YouJiacheng

@kellerjordan0 what will happen if we skip (3) but still do (1)(2)(4)(5)?

47mViews 115Likes 2
Load more posts