/AI59m ago

CoreAutoAI co-founder Rohan Anil shows Meta's unmodified PyTorch Shampoo package achieves competitive NanoGPT speed-run performance

Story Overview

Rohan Anil demonstrated that Meta's stock DistributedShampoo implementation from facebookresearch/optimizers can hit the modded-nanogpt target loss of 3.28 on FineWeb without any source edits. The gains came from swapping to a truncated pseudo-inverse for the preconditioner step plus a narrow set of hyperparameter values including lr=0.01, beta2=0.9, eps=1e-15 and precondition_frequency=1.

151244317K
Original post
rohan anil@_arohan_#79inAI

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet.

The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

1:51 AM · Jun 10, 2026 · 3.2K Views
Developer Impact

Tuning revives an official package

The same package that once won AlgoPerf now matches Muon-level curves on this 124M model benchmark once the rank-deficiency issue is handled through the eigenvalue truncation trick.

Open Question

Limits remain visible

Step counts, wall-clock times and performance on larger models or different batch sizes are not reported yet, leaving open whether the config travels beyond this specific speed-run setup.

Sentiment

Users are excited that Rohan Anil matched Muon optimizer performance by tuning just a few lines of Shampoo config because the minimal-edit approach feels like an impressively hacker-style win.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS715
elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

45mViews 715Likes 5Bookmarks 0
LIKES11
Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

rohan anil@_arohan_

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

48mViews 518Likes 11Bookmarks 0
REPLIES1
rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

42mViews 265Likes 9Bookmarks 0
rohan anil@_arohan_

@kellerjordan0 You can probably rerun it with various targets now. I listed a bunch of next steps that will be interesting. I wont be doing more work on this, since I am a bit busy building a new company :)

Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

46mViews 441Likes 10Bookmarks 0
Keller Jordan@kellerjordan0

@_arohan_ Ok, well your comparison is between an undertuned Muon and a fully-tuned Shampoo.

I guess you used result #6 logs instead of #12? Not a massive deal, but I'll make the fair comparison when I do a post

rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

27mViews 147Likes 7Bookmarks 0
Keller Jordan@kellerjordan0

@_arohan_ Sorry wdym? The main difference is a better-tuned learning rate compared to the old baseline.

Both your logfile above and the proper Muon baseline would qualify as 3325-step runs, since your logfile hits <3.28 at that time. Tho yours would need more stat sig, but that's trivial.

rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

37mViews 196Likes 2Bookmarks 0
elie@eliebakouch

@_arohan_ ok interesting was not sure if this included the previous changes as well or not thanks!

btw how do you expect change of architecture to interact with the different optimizer?

rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

38mViews 129Likes 2Bookmarks 0
Keller Jordan@kellerjordan0

@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.

45mViews 78Likes 2
rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

35mViews 77Likes 1
rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

41mViews 56Likes 1
rohan anil@_arohan_

@eliebakouch Too much alpha given already. Researcher reciprocity license, so lets chat on DM

37mViews 18
Justine@ShillerP77755

@_arohan_ Kindly check out runn3rai. The beat ai project. One platform for everything @runn3r0101

18m
Alex YGift@Radipdegen

@_arohan_ nice, the minimal edit distance flex is actually the most hacker way to win a disagreement

50m
Rugbist@rugbist_

@_arohan_ respect for taking the challenge seriously

the minimal edit distance approach says more than any full rewrite could

56m
Invincible@InvincibleEdge

@_arohan_ Wait so the difference between buggy and working was just a few lines of config tuning not even code changes.

Thats actually wild

57m
Blissy@BlissyOnX

@_arohan_ interesting how muon doesnt have the same distributed implementation issues as shampoo

57m