/AI59m ago

CoreAutoAI co-founder Rohan Anil shows Meta's unmodified PyTorch Shampoo package achieves competitive NanoGPT speed-run performance

Story Overview

Rohan Anil demonstrated that Meta's stock DistributedShampoo implementation from facebookresearch/optimizers can hit the modded-nanogpt target loss of 3.28 on FineWeb without any source edits. The gains came from swapping to a truncated pseudo-inverse for the preconditioner step plus a narrow set of hyperparameter values including lr=0.01, beta2=0.9, eps=1e-15 and precondition_frequency=1.

151244317K

#79

Original post

rohan anil@_arohan_#79inAI

Hyper-parameter tuning works really well!

Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.

Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet.

The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.

The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?

I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.

1:51 AM · Jun 10, 2026 · 3.2K Views

/AI59m ago

CoreAutoAI co-founder Rohan Anil shows Meta's unmodified PyTorch Shampoo package achieves competitive NanoGPT speed-run performance

Story Overview

151244317K

#79

Original post

rohan anil@_arohan_#79inAI

Hyper-parameter tuning works really well!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

Keller Jordan@kellerjordan0

Thank you for this result! Here's one initial correction:

In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.

But this is incorrect: I did not implement my own version of Shampoo.

The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.

1:51 AM · Jun 10, 2026 · 3.2K Views

Developer Impact

Tuning revives an official package

The same package that once won AlgoPerf now matches Muon-level curves on this 124M model benchmark once the rank-deficiency issue is handled through the eigenvalue truncation trick.

Open Question

Limits remain visible

Step counts, wall-clock times and performance on larger models or different batch sizes are not reported yet, leaving open whether the config travels beyond this specific speed-run setup.

Sentiment

Users are excited that Rohan Anil matched Muon optimizer performance by tuning just a few lines of Shampoo config because the minimal-edit approach feels like an impressively hacker-style win.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

rohan anil@_arohan_

Here you go, sir. Muon is a good optimizer.

I think Keller's attempt at implementing it is great -- this primarily helped me look at why his effort produced much worse looking curve than what I could get.

This ended up being a nerdsnipe into hyper parameters, grafting, and eigh calls in the distributed shampoo package from Meta in my car ride home.

The main delta's are below:

45m71550

LIKES11

Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

rohan anil@_arohan_

Hyper-parameter tuning works really well!

Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃

No Nesterov momentum was used. Oops.

- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675

[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192

Next steps for anyone looking for a late night hobby:

* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.

48m518110

REPLIES1

rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

elie@eliebakouch

@_arohan_ this is different changes from the curve in the initial tweet that looked better no?

42m26590

rohan anil@_arohan_

@kellerjordan0 You can probably rerun it with various targets now. I listed a bunch of next steps that will be interesting. I wont be doing more work on this, since I am a bit busy building a new company :)

Keller Jordan@kellerjordan0

@_arohan_ Thanks a lot for the logfile!

One error: You're using an obsolete Muon baseline. The current Muon baseline is 3325 steps, not 3375, it's result #12.

46m441100

Keller Jordan@kellerjordan0

@_arohan_ Ok, well your comparison is between an undertuned Muon and a fully-tuned Shampoo.

I guess you used result #6 logs instead of #12? Not a massive deal, but I'll make the fair comparison when I do a post

rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

27m14770

Keller Jordan@kellerjordan0

@_arohan_ Sorry wdym? The main difference is a better-tuned learning rate compared to the old baseline.

Both your logfile above and the proper Muon baseline would qualify as 3325-step runs, since your logfile hits <3.28 at that time. Tho yours would need more stat sig, but that's trivial.

rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

37m19620

elie@eliebakouch

@_arohan_ ok interesting was not sure if this included the previous changes as well or not thanks!

btw how do you expect change of architecture to interact with the different optimizer?

rohan anil@_arohan_

That was with nestrov momentum, and other changes as I wrote down in the original tweet with a different step target since it will cool down learning rate as well.

I want to provide a proof that Meta’s distributed shampoo package is pretty darn good, and all the bells and whistles are no longer needed.

You just need to tune stuff bro.

38m12920

Keller Jordan@kellerjordan0

@_arohan_ This is an interesting result. I'm just saying compare against the real baseline please! There's no rerunning necessary.

45m782

rohan anil@_arohan_

@kellerjordan0 I ran these with 3375 end steps. Because thats what I saw in from the logs I was using. I am okay with this.

35m771

rohan anil@_arohan_

@kellerjordan0 You get a different horizon with learning rate decay?

41m561

rohan anil@_arohan_

@eliebakouch Too much alpha given already. Researcher reciprocity license, so lets chat on DM

37m18

Justine@ShillerP77755

@_arohan_ Kindly check out runn3rai. The beat ai project. One platform for everything @runn3r0101

18m

Alex YGift@Radipdegen

@_arohan_ nice, the minimal edit distance flex is actually the most hacker way to win a disagreement

50m

Rugbist@rugbist_

@_arohan_ respect for taking the challenge seriously

the minimal edit distance approach says more than any full rewrite could

56m

Invincible@InvincibleEdge

@_arohan_ Wait so the difference between buggy and working was just a few lines of config tuning not even code changes.

Thats actually wild

57m

Blissy@BlissyOnX

@_arohan_ interesting how muon doesnt have the same distributed implementation issues as shampoo

57m