/Tech6h ago

Zachary Nado proposes standardized tuning budgets to prevent optimizers from overfitting to the `modded-nanogpt` benchmark

Muon's target loss steps fell to 3,250 after tuning

665066K

#589

Original post

You Jiacheng@YouJiacheng#903inTech

😂lol now Shampoo is again in the middle between Muon and Adam. next round, who will tune Shampoo to match / surpass Muon?

Konstantin Mishchenko@konstmish

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

7:33 AM · Jun 11, 2026 · 4.3K Views

/Tech6h ago

Zachary Nado proposes standardized tuning budgets to prevent optimizers from overfitting to the `modded-nanogpt` benchmark

Muon's target loss steps fell to 3,250 after tuning

665066K

#589

Original post

You Jiacheng@YouJiacheng#903inTech

😂lol now Shampoo is again in the middle between Muon and Adam. next round, who will tune Shampoo to match / surpass Muon?

Konstantin Mishchenko@konstmish

7:33 AM · Jun 11, 2026 · 4.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS816BOOKMARKS2LIKES14REPLIES2

kalomaze@kalomaze

the grid sweep chud cowers at the sight of the "guess ballpark-ish hparams and pray that they work" lion

tokenbender@tokenbender

tune your hparams hard. RNGesus is waiting for you in the loss valley somewhere.

3h816142

Konstantin Mishchenko@konstmish

@kellerjordan0 Codex gave me this estimate: ~77% of the gain came from AdamW hyperparams (make learning rates 1.5 bigger) ~23% came from Muon hyperparams (smaller lr, bigger weight decay)

I didn't touch anything else.

Keller Jordan@kellerjordan0

@konstmish Very nice. Would you happen to know how much of the improvement was from tuning the Muon hparams? As opposed to AdamW

1h14780

Konstantin Mishchenko@konstmish

@zacharynado Yes, but then it has to be a lot more than a single problem. Different scales, domains, architectures. Like AlgoPerf.

Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

4h21040

Zachary Nado@zacharynado

@YouJiacheng we should give each of them the same tuning budget and find out :) at a certain point we'll just be overfitting to this specific benchmark though

Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

4h65900

Puneesh Deora@puneeshdeora

@YouJiacheng We should try these on a bigger scale we're really overcooking on this problem

4h5