/Tech3h ago

Meta's Konstantin Mishchenko optimizes the modded-nanogpt baseline, warning that weak baselines make new optimizers look deceptively promising

Story Overview

Meta AI researcher Konstantin Mishchenko tightened the Muon baseline on the modded-nanogpt benchmark, shaving steps needed to hit the FineWeb validation target loss of 3.28 down to 3,250. His note stresses that poorly tuned references can inflate the apparent gains of newer optimizers, a reminder issued right after community discussion around an undocumented DistributedShampoo flag that had boosted earlier Shampoo numbers.

1912451923.4K

#62

Original post

Konstantin Mishchenko@konstmish#1792inTech

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

According to @_arohan_, the boost depends upon fixing a numerical linear algebra issue that he observed to occur in my initial standard DistributedShampoo run. He fixed the issue by enabling the flag rank_deficient_stability_config=PseudoInverseConfig().

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

As a result, if someone tries to train a model using DistributedShampoo without either (a) knowing about this special undocumented flag or (b) being prepared to detect and fix the numerical linear algebra issues that may occur without it, then they won't be able to achieve @_arohan_'s level of Shampoo performance. This level of effort would be considered atypical for mere hyperparameter tuning.

-- [Note on Muon baseline in plot below: Rohan's post compared Shampoo to a slightly undertuned Muon baseline from 2026/05/01, which reached the target loss in 3375 steps. This resulted in a 50-step gap between Shampoo and Muon. In the figure below I'm using the up-to-date 2026/05/03 baseline, which reaches the target in 3325 steps. This results in the step-counts exactly matching between Muon and the tuned/stabilized Shampoo variant.]

7:15 AM · Jun 11, 2026 · 7.3K Views

/Tech3h ago

Meta's Konstantin Mishchenko optimizes the modded-nanogpt baseline, warning that weak baselines make new optimizers look deceptively promising

Story Overview

1912451923.4K

#62

Original post

Konstantin Mishchenko@konstmish#1792inTech

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

7:15 AM · Jun 11, 2026 · 7.3K Views

Benchmark Watch

Tighter references change how optimizer results read

With the Muon baseline now at 3,250 steps, any optimizer claiming gains must clear a higher bar on this public 124M-parameter speedrun track. The move directly narrows the window where incremental hyperparameter tweaks can masquerade as architectural wins.

Open Question

Hidden config choices still cloud fair comparisons

Recent Shampoo runs relied on a stability flag buried deep in the codebase that was not part of routine tuning sweeps. Mishchenko’s baseline update leaves open how much those earlier numbers truly reflected vanilla optimizer performance versus one-off engineering fixes.

Sentiment

Users in the replies dismiss the Shampoo Optimizer's strong NanoGPT benchmark results from an undocumented flag as unbelievable because the baselines lack a basic sanity check for convergence.

Pos

62.5%

Neg

37.5%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS9.1KBOOKMARKS9LIKES38REPLIES3

Dimitris Papailiopoulos@DimitrisPapail

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

Keller Jordan@kellerjordan0

I have some mixed feelings about this result:

On the one hand, it's genuinely impressive. I didn't know that Shampoo could be configured to perform this well on the benchmark.

On the other hand, the way this performance boost was achieved seems difficult to call "Vanilla," for the following reason:

Here's the problem: This is an undocumented flag. It is contained within the 12,000-line DistributedShampoo codebase, but it does not appear in any user-facing documentation.

2h9.1K389

RETWEETS1

Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

Konstantin Mishchenko@konstmish

49m56570

Dimitris Papailiopoulos@DimitrisPapail

wish my runpod credits luck guys

Dimitris Papailiopoulos@DimitrisPapail

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

2h5.5K151

Lucas Beyer (bl16)@giffmana

With or without momentum? Back in ViT days i trained a large ViT with plain SGDM to within .5% accuracy of the AdamW one. Without momentum i didn't manage to get as close, maybe 4% or so behind iirc. Batch size had to be large indeed. I would expect qualitatively similar thing here.

Dimitris Papailiopoulos@DimitrisPapail

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

1h1.4K150

Keller Jordan@kellerjordan0

@konstmish Very nice. Would you happen to know how much of the improvement was from tuning the Muon hparams? As opposed to AdamW

Konstantin Mishchenko@konstmish

1h73862

Alex Chaloner@alex_chaloner

@konstmish I think it's reasonable to judge optimizers by how sensitive they are to hyperparameters, although i think muon is generally good in this respect

How much did you have to search to find the better hyperparams?

2h23911

Dimitris Papailiopoulos@DimitrisPapail

@giffmana no momentum

1h1301

Volkan Cevher@CevherLIONS

So, what new insight did we gain from this exercise @konstmish other than spend more compute to tune hyperparameters?

I must admit, I have been trying myself with help from others but I think I will give up since it may take me next year to get a good configuration with my compute.

15m583

Dimitris Papailiopoulos@DimitrisPapail

@konstmish I WILL NOT LOSE HOPE

16m952

Konstantin Mishchenko@konstmish

@DimitrisPapail SGS tends to be quite suboptimal with transformers, it struggles even with minimizing the train loss.

Dimitris Papailiopoulos@DimitrisPapail

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

17m7410

Konstantin Mishchenko@konstmish

@CevherLIONS There are hundreds of papers claiming they propose optimizers better than AdamW, where the improvement comes from undertuned baseline. All I want is for this not to be the case here. But I'm also a bigger believer in AlgoPerf self-tuning track than modded nanogpt.

10m341

Anirbit@anirbit_maths

@konstmish Did you run the code long enough to check for convergence? None of these displayed baselines are believable because this basic sanity check is not done 🤷‍♂️

2h2012

Konstantin Mishchenko@konstmish

@zacharynado Yes, but then it has to be a lot more than a single problem. Different scales, domains, architectures. Like AlgoPerf.

21m602

Lucas Beyer (bl16)@giffmana

@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?

1h841

Dimitris Papailiopoulos@DimitrisPapail

gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.

2h127

Dimitris Papailiopoulos@DimitrisPapail

@giffmana im comparing wallclock, steps is weird with SGD since it uses less flops and memory per step

41m91

Jeffrey Li 💙💛@askerlee

@giffmana @DimitrisPapail Interesting... Was it because the batch size was huge?

39m27

Ferbin@Ferbin08

@DimitrisPapail curious if you saw similar gaps when training on real robot footage. robotics datasets have way more edge cases than typical vision benchmarks. did batch size help there too, or did momentum become even more critical with the noisier data?

58m24

Dimitris Papailiopoulos@DimitrisPapail

@konstmish till i run out of runpod credits

16m18

Konstantin Mishchenko@konstmish

@anirbit_maths It's very far from convergence, which is my biggest complaint about modded nanogpt in general.

9m10