/Tech1h ago

Muon Optimizer Crosses 3.28 Val Loss Target on NanoGPT in 574s

31110890

#203

Original post

Dimitris Papailiopoulos@DimitrisPapail#203inTech

[+574.3s train_time] Muon first crossed the practical 3.28 target: step 3300, val_loss=3.27976.

[+578.3s train_time] Published/current target region: step 3325, val_loss=3.27855.

[+582.4s train_time] Final observed Muon point: step 3350, val_loss=3.27796.

Dimitris Papailiopoulos@DimitrisPapail

updates from Codex:

- We got onto the RunPod 8×H100 node and cloned KellerJordan/modded-nanogpt.

- Hardware is healthy: 8× H100 80GB visible. FineWeb data download completed at 10:32:58 AM CT.

- Muon baseline launched at 10:44:39 AM CT

- GPU util was confirmed maxed: all 8 GPUs at 100%, ~36GB memory each, ~620-690W.

-Muon reached val_loss=4.12937 at 81.6s train time around step 250.

- No SGD results yet.

9:07 AM · Jun 11, 2026 · 425 Views

/Tech1h ago

Muon Optimizer Crosses 3.28 Val Loss Target on NanoGPT in 574s

31110890

#203

Original post

Dimitris Papailiopoulos@DimitrisPapail#203inTech

[+574.3s train_time] Muon first crossed the practical 3.28 target: step 3300, val_loss=3.27976.

[+578.3s train_time] Published/current target region: step 3325, val_loss=3.27855.

[+582.4s train_time] Final observed Muon point: step 3350, val_loss=3.27796.

Dimitris Papailiopoulos@DimitrisPapail

updates from Codex:

- We got onto the RunPod 8×H100 node and cloned KellerJordan/modded-nanogpt.

- Hardware is healthy: 8× H100 80GB visible. FineWeb data download completed at 10:32:58 AM CT.

- Muon baseline launched at 10:44:39 AM CT

- GPU util was confirmed maxed: all 8 GPUs at 100%, ~36GB memory each, ~620-690W.

-Muon reached val_loss=4.12937 at 81.6s train time around step 250.

- No SGD results yet.

9:07 AM · Jun 11, 2026 · 425 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS322LIKES3RETWEETS1REPLIES1

Dimitris Papailiopoulos@DimitrisPapail

[SGD first probe] The first pure no-state SGD run used batch 524k, no Adam/Muon, no momentum, but LR was too high: validation loss stayed around 11, so it was cut early.

[11:02:03 AM CT] Lower-LR SGD probe idea_008 was launched in the parent run: batch 524k, no weight decay, 500-step screen.

Dimitris Papailiopoulos@DimitrisPapail

[+574.3s train_time] Muon first crossed the practical 3.28 target: step 3300, val_loss=3.27976.

[+578.3s train_time] Published/current target region: step 3325, val_loss=3.27855.

[+582.4s train_time] Final observed Muon point: step 3350, val_loss=3.27796.

1h32230

Dimitris Papailiopoulos@DimitrisPapail

@konstmish I WILL NOT LOSE HOPE

Dimitris Papailiopoulos@DimitrisPapail

not looking good for minimalists, but I aint losing hope

32m12230

Dimitris Papailiopoulos@DimitrisPapail

@konstmish till i run out of runpod credits

Dimitris Papailiopoulos@DimitrisPapail

@konstmish I WILL NOT LOSE HOPE

32m2110

Dimitris Papailiopoulos@DimitrisPapail

[11:04 AM CT] idea_008 was also bad: val_loss rose to ~14.8 by ~475 steps, so we learned the raw SGD LR scale was still wrong.

[Key diagnosis] The benchmark loss uses reduction="sum" and gradients are summed across ranks. For vanilla SGD, LR must be orders smaller than Adam/Muon-style LR numbers.

[~11:07 AM CT] New log-scale pure-SGD probes were generated: no Adam/Muon/momentum, 125-step screens, validation every 25 steps, LR range roughly 1e-8 to 1e-6.

[11:08-11:09 AM CT] idea_020 with LR 1e-7 was stable but slow: step 125 val_loss=8.17933, train_time 55.9s.

[11:09-11:10 AM CT] idea_021 with LR 3e-7 improved: step 125 val_loss=7.73179, train_time 22.4s.

[11:11-11:12 AM CT] idea_022 with LR 1e-6 improved further: step 125 val_loss=7.37449, train_time 22.4s.

[Current read] SGD is faster per steady step and uses less optimizer memory, but it is still far behind Muon in loss at the same early step count: Muon step 125 was 4.67992; best SGD probe so far is 7.37449.

[Next useful experiment] Let the best stable SGD setting run to 500 or 1000 steps, then try larger microbatch/batch settings that use SGD’s lower optimizer memory, while keeping GPU util near 100%.

1h41