/Tech3h ago

Meta's Konstantin Mishchenko optimizes the modded-nanogpt baseline, warning that weak baselines make new optimizers look deceptively promising

Story Overview

Meta AI researcher Konstantin Mishchenko tightened the Muon baseline on the modded-nanogpt benchmark, shaving steps needed to hit the FineWeb validation target loss of 3.28 down to 3,250. His note stresses that poorly tuned references can inflate the apparent gains of newer optimizers, a reminder issued right after community discussion around an undocumented DistributedShampoo flag that had boosted earlier Shampoo numbers.

1912451923.4K
Original post
Konstantin Mishchenko@konstmish#1792inTech

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

According to @_arohan_, the boost depends upon fixing a numerical linear algebra issue that he observed to occur in my initial standard DistributedShampoo run. He fixed the issue by enabling the flag rank_deficient_stability_config=PseudoInverseConfig().

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

As a result, if someone tries to train a model using DistributedShampoo without either (a) knowing about this special undocumented flag or (b) being prepared to detect and fix the numerical linear algebra issues that may occur without it, then they won't be able to achieve @_arohan_'s level of Shampoo performance. This level of effort would be considered atypical for mere hyperparameter tuning.

-- [Note on Muon baseline in plot below: Rohan's post compared Shampoo to a slightly undertuned Muon baseline from 2026/05/01, which reached the target loss in 3375 steps. This resulted in a 50-step gap between Shampoo and Muon. In the figure below I'm using the up-to-date 2026/05/03 baseline, which reaches the target in 3325 steps. This results in the step-counts exactly matching between Muon and the tuned/stabilized Shampoo variant.]

7:15 AM · Jun 11, 2026 · 7.3K Views
Benchmark Watch

Tighter references change how optimizer results read

With the Muon baseline now at 3,250 steps, any optimizer claiming gains must clear a higher bar on this public 124M-parameter speedrun track. The move directly narrows the window where incremental hyperparameter tweaks can masquerade as architectural wins.

Open Question

Hidden config choices still cloud fair comparisons

Recent Shampoo runs relied on a stability flag buried deep in the codebase that was not part of routine tuning sweeps. Mishchenko’s baseline update leaves open how much those earlier numbers truly reflected vanilla optimizer performance versus one-off engineering fixes.

Sentiment

Users in the replies dismiss the Shampoo Optimizer's strong NanoGPT benchmark results from an undocumented flag as unbelievable because the baselines lack a basic sanity check for convergence.

Pos
62.5%
Neg
37.5%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS9.1KBOOKMARKS9LIKES38REPLIES3

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

According to @_arohan_, the boost depends upon fixing a numerical linear algebra issue that he observed to occur in my initial standard DistributedShampoo run. He fixed the issue by enabling the flag rank_deficient_stability_config=PseudoInverseConfig().

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

As a result, if someone tries to train a model using DistributedShampoo without either (a) knowing about this special undocumented flag or (b) being prepared to detect and fix the numerical linear algebra issues that may occur without it, then they won't be able to achieve @_arohan_'s level of Shampoo performance. This level of effort would be considered atypical for mere hyperparameter tuning.

-- [Note on Muon baseline in plot below: Rohan's post compared Shampoo to a slightly undertuned Muon baseline from 2026/05/01, which reached the target loss in 3375 steps. This resulted in a 50-step gap between Shampoo and Muon. In the figure below I'm using the up-to-date 2026/05/03 baseline, which reaches the target in 3325 steps. This results in the step-counts exactly matching between Muon and the tuned/stabilized Shampoo variant.]

2hViews 9.1KLikes 38Bookmarks 9
RETWEETS1
Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

49mViews 565Likes 7Bookmarks 0

wish my runpod credits luck guys

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

2hViews 5.5KLikes 15Bookmarks 1

With or without momentum? Back in ViT days i trained a large ViT with plain SGDM to within .5% accuracy of the AdamW one. Without momentum i didn't manage to get as close, maybe 4% or so behind iirc. Batch size had to be large indeed. I would expect qualitatively similar thing here.

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

1hViews 1.4KLikes 15Bookmarks 0
Keller Jordan@kellerjordan0

@konstmish Very nice. Would you happen to know how much of the improvement was from tuning the Muon hparams? As opposed to AdamW

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

1hViews 738Likes 6Bookmarks 2
Alex Chaloner@alex_chaloner

@konstmish I think it's reasonable to judge optimizers by how sensitive they are to hyperparameters, although i think muon is generally good in this respect

How much did you have to search to find the better hyperparams?

2hViews 239Likes 1Bookmarks 1
Volkan Cevher@CevherLIONS

So, what new insight did we gain from this exercise @konstmish other than spend more compute to tune hyperparameters?

I must admit, I have been trying myself with help from others but I think I will give up since it may take me next year to get a good configuration with my compute.

15mViews 58Likes 3

@DimitrisPapail SGS tends to be quite suboptimal with transformers, it struggles even with minimizing the train loss.

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

17mViews 74Likes 1Bookmarks 0

@CevherLIONS There are hundreds of papers claiming they propose optimizers better than AdamW, where the improvement comes from undertuned baseline. All I want is for this not to be the case here. But I'm also a bigger believer in AlgoPerf self-tuning track than modded nanogpt.

10mViews 34Likes 1
Anirbit@anirbit_maths

@konstmish Did you run the code long enough to check for convergence? None of these displayed baselines are believable because this basic sanity check is not done 🤷‍♂️

2hViews 201Likes 2

@zacharynado Yes, but then it has to be a lot more than a single problem. Different scales, domains, architectures. Like AlgoPerf.

21mViews 60Likes 2

@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?

1hViews 84Likes 1

gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.

2hViews 127

@giffmana im comparing wallclock, steps is weird with SGD since it uses less flops and memory per step

41mViews 91

@giffmana @DimitrisPapail Interesting... Was it because the batch size was huge?

39mViews 27
Ferbin@Ferbin08

@DimitrisPapail curious if you saw similar gaps when training on real robot footage. robotics datasets have way more edge cases than typical vision benchmarks. did batch size help there too, or did momentum become even more critical with the noisier data?

58mViews 24

@anirbit_maths It's very far from convergence, which is my biggest complaint about modded nanogpt in general.

9mViews 10
Load more posts