/Tech6h ago

Zachary Nado proposes standardized tuning budgets to prevent optimizers from overfitting to the `modded-nanogpt` benchmark

Muon's target loss steps fell to 3,250 after tuning

665066K
Original post
You Jiacheng@YouJiacheng#903inTech

😂lol now Shampoo is again in the middle between Muon and Adam. next round, who will tune Shampoo to match / surpass Muon?

I just submitted a PR to modded-nanogpt with better hyperparams. With them, Muon can reach the target loss after 3250 steps instead of 3325. Always tune your baseline well when doing research. Weak baselines can make any idea look promising

7:33 AM · Jun 11, 2026 · 4.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS816BOOKMARKS2LIKES14REPLIES2
kalomaze@kalomaze

the grid sweep chud cowers at the sight of the "guess ballpark-ish hparams and pray that they work" lion

tokenbender@tokenbender

tune your hparams hard. RNGesus is waiting for you in the loss valley somewhere.

3hViews 816Likes 14Bookmarks 2

@kellerjordan0 Codex gave me this estimate: ~77% of the gain came from AdamW hyperparams (make learning rates 1.5 bigger) ~23% came from Muon hyperparams (smaller lr, bigger weight decay)

I didn't touch anything else.

Keller Jordan@kellerjordan0

@konstmish Very nice. Would you happen to know how much of the improvement was from tuning the Muon hparams? As opposed to AdamW

1hViews 147Likes 8Bookmarks 0

@zacharynado Yes, but then it has to be a lot more than a single problem. Different scales, domains, architectures. Like AlgoPerf.

Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

4hViews 210Likes 4Bookmarks 0
Zachary Nado@zacharynado

@YouJiacheng we should give each of them the same tuning budget and find out :) at a certain point we'll just be overfitting to this specific benchmark though

Zachary Nado@zacharynado

@konstmish really if we are doing this fairly, each algo should have the same tuning budget, defined by number of hparam points tried.

it implicitly punishes algos with more hparams to tune, but imo that's a fair punishment because in reality practitioners only have a limited tuning budget

4hViews 659Likes 0Bookmarks 0
Puneesh Deora@puneeshdeora

@YouJiacheng We should try these on a bigger scale we're really overcooking on this problem

4hViews 5