/Tech2h ago

Microsoft Research's Dimitris Papailiopoulos is benchmarking tuned SGD against Muon to see if SGD can close the validation loss gap

AI Judge changed title after evaluation, original title: "Dimitris Papailiopoulos benchmarks vanilla SGD against advanced optimizers, drawing critique from Lucas Beyer over speedrun batch size constraints"

The five-hour benchmark runs on eight H100 GPUs.

1249015.5K

#62

Original post

Dimitris Papailiopoulos@DimitrisPapail#203inTech

gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.

Dimitris Papailiopoulos@DimitrisPapail

wish my runpod credits luck guys

8:24 AM · Jun 11, 2026 · 603 Views

/Tech2h ago

Microsoft Research's Dimitris Papailiopoulos is benchmarking tuned SGD against Muon to see if SGD can close the validation loss gap

The five-hour benchmark runs on eight H100 GPUs.

1249015.5K

#62

Original post

Dimitris Papailiopoulos@DimitrisPapail#203inTech

gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.

Dimitris Papailiopoulos@DimitrisPapail

wish my runpod credits luck guys

8:24 AM · Jun 11, 2026 · 603 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.3KBOOKMARKS1LIKES33

Dimitris Papailiopoulos@DimitrisPapail

the burn has started. i bet you nothing good will come out of it, but it's entertaining

Dimitris Papailiopoulos@DimitrisPapail

wish my runpod credits luck guys

1h3.3K331

REPLIES2

Dimitris Papailiopoulos@DimitrisPapail

@_arohan_ are you offering me credits? 🥲 i do not accept payment in optimizers

rohan anil@_arohan_

@DimitrisPapail I can help with this.

13m4000

Dimitris Papailiopoulos@DimitrisPapail

updates from Codex:

- We got onto the RunPod 8×H100 node and cloned KellerJordan/modded-nanogpt.

- Hardware is healthy: 8× H100 80GB visible. FineWeb data download completed at 10:32:58 AM CT.

- Muon baseline launched at 10:44:39 AM CT

- GPU util was confirmed maxed: all 8 GPUs at 100%, ~36GB memory each, ~620-690W.

-Muon reached val_loss=4.12937 at 81.6s train time around step 250.

- No SGD results yet.

Dimitris Papailiopoulos@DimitrisPapail

the burn has started. i bet you nothing good will come out of it, but it's entertaining

1h918110

Dimitris Papailiopoulos@DimitrisPapail

my father in law is preparing Ceviche and he just passed me a glass of sparkling water. Wait it was actually sparkling wine. Expect the speedrun to slow down.

Dimitris Papailiopoulos@DimitrisPapail

[11:13-11:20 AM CT] Best uniform pure-SGD run so far was idea_027: LR `1e-6`, batch `524k`, no momentum/Adam/Muon. It reached val_loss `6.79449` at step 500, train_time `82.1s`.

[11:20-11:25 AM CT] Using SGD’s memory headroom for 2× batch fit and kept GPUs full, but did not help: idea_028 reached val_loss `7.06098` at step 250, train_time `112.2s`. More batch was worse per second, not better.

[11:30-11:45 AM CT] Layerwise LR scaling was tested across embeddings/head/scalars/attention/MLP and actual depth scaling. It helps a little, but not enough. Best 125-step screen was idea_044 at val_loss `7.20629`, better than uniform SGD’s `7.37` at the same point.

[11:45-11:52 AM CT] Longer best layerwise run idea_055: embed/scalar `1e-6`, head `3e-6`, matrices `5e-7`, batch `524k`. It reached val_loss `6.37802` at step 1000, train_time `161.6s` / real time `197.9s`.

[Current read] SGD is not losing because of GPU utilization or obvious wall-clock overhead. The steady step time is competitive and memory is lower, but optimizer dynamics are massively behind Muon. At roughly comparable early wall-clock, Muon is already around `3.8` or better while best clean SGD is still around `6.4-6.8`.

[Next useful experiment] The only still-plausible cheap knobs are small layerwise refinements around idea_044/055 and maybe coupled classical L2. I would not spend much more budget on bigger batch unless the goal is specifically to document that SGD memory headroom does not translate into faster loss reduction here.

32m17520

Dimitris Papailiopoulos@DimitrisPapail

not looking good for minimalists, but I aint losing hope

Dimitris Papailiopoulos@DimitrisPapail

my father in law is preparing Ceviche and he just passed me a glass of sparkling water. Wait it was actually sparkling wine. Expect the speedrun to slow down.

21m21130

Lucas Beyer (bl16)@giffmana

@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?

Dimitris Papailiopoulos@DimitrisPapail

@giffmana no momentum

1h17620

Dimitris Papailiopoulos@DimitrisPapail

[11:04 AM CT] idea_008 was also bad: val_loss rose to ~14.8 by ~475 steps, so we learned the raw SGD LR scale was still wrong.

[Key diagnosis] The benchmark loss uses reduction="sum" and gradients are summed across ranks. For vanilla SGD, LR must be orders smaller than Adam/Muon-style LR numbers.

[~11:07 AM CT] New log-scale pure-SGD probes were generated: no Adam/Muon/momentum, 125-step screens, validation every 25 steps, LR range roughly 1e-8 to 1e-6.

[11:08-11:09 AM CT] idea_020 with LR 1e-7 was stable but slow: step 125 val_loss=8.17933, train_time 55.9s.

[11:09-11:10 AM CT] idea_021 with LR 3e-7 improved: step 125 val_loss=7.73179, train_time 22.4s.

[11:11-11:12 AM CT] idea_022 with LR 1e-6 improved further: step 125 val_loss=7.37449, train_time 22.4s.

[Current read] SGD is faster per steady step and uses less optimizer memory, but it is still far behind Muon in loss at the same early step count: Muon step 125 was 4.67992; best SGD probe so far is 7.37449.

[Next useful experiment] Let the best stable SGD setting run to 500 or 1000 steps, then try larger microbatch/batch settings that use SGD’s lower optimizer memory, while keeping GPU util near 100%.

Dimitris Papailiopoulos@DimitrisPapail

[SGD first probe] The first pure no-state SGD run used batch 524k, no Adam/Muon, no momentum, but LR was too high: validation loss stayed around 11, so it was cut early.

[11:02:03 AM CT] Lower-LR SGD probe idea_008 was launched in the parent run: batch 524k, no weight decay, 500-step screen.

1h25310

Dimitris Papailiopoulos@DimitrisPapail

[11:13-11:20 AM CT] Best uniform pure-SGD run so far was idea_027: LR `1e-6`, batch `524k`, no momentum/Adam/Muon. It reached val_loss `6.79449` at step 500, train_time `82.1s`.

Dimitris Papailiopoulos@DimitrisPapail

[11:04 AM CT] idea_008 was also bad: val_loss rose to ~14.8 by ~475 steps, so we learned the raw SGD LR scale was still wrong.

[Key diagnosis] The benchmark loss uses reduction="sum" and gradients are summed across ranks. For vanilla SGD, LR must be orders smaller than Adam/Muon-style LR numbers.

[~11:07 AM CT] New log-scale pure-SGD probes were generated: no Adam/Muon/momentum, 125-step screens, validation every 25 steps, LR range roughly 1e-8 to 1e-6.

[11:08-11:09 AM CT] idea_020 with LR 1e-7 was stable but slow: step 125 val_loss=8.17933, train_time 55.9s.

[11:09-11:10 AM CT] idea_021 with LR 3e-7 improved: step 125 val_loss=7.73179, train_time 22.4s.

[11:11-11:12 AM CT] idea_022 with LR 1e-6 improved further: step 125 val_loss=7.37449, train_time 22.4s.

[Next useful experiment] Let the best stable SGD setting run to 500 or 1000 steps, then try larger microbatch/batch settings that use SGD’s lower optimizer memory, while keeping GPU util near 100%.

36m14300

rohan anil@_arohan_

@DimitrisPapail I can help with this.

Dimitris Papailiopoulos@DimitrisPapail

not looking good for minimalists, but I aint losing hope

14m6300

Dimitris Papailiopoulos@DimitrisPapail

@giffmana im comparing wallclock, steps is weird with SGD since it uses less flops and memory per step

Lucas Beyer (bl16)@giffmana

@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?

41m9100