gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.
wish my runpod credits luck guys
AI Judge changed title after evaluation, original title: "Dimitris Papailiopoulos benchmarks vanilla SGD against advanced optimizers, drawing critique from Lucas Beyer over speedrun batch size constraints"
The five-hour benchmark runs on eight H100 GPUs.
gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.
wish my runpod credits luck guys
the burn has started. i bet you nothing good will come out of it, but it's entertaining
wish my runpod credits luck guys
@_arohan_ are you offering me credits? 🥲 i do not accept payment in optimizers
@DimitrisPapail I can help with this.
updates from Codex:
- We got onto the RunPod 8×H100 node and cloned KellerJordan/modded-nanogpt.
- Hardware is healthy: 8× H100 80GB visible. FineWeb data download completed at 10:32:58 AM CT.
- Muon baseline launched at 10:44:39 AM CT
- GPU util was confirmed maxed: all 8 GPUs at 100%, ~36GB memory each, ~620-690W.
-Muon reached val_loss=4.12937 at 81.6s train time around step 250.
- No SGD results yet.
the burn has started. i bet you nothing good will come out of it, but it's entertaining
my father in law is preparing Ceviche and he just passed me a glass of sparkling water. Wait it was actually sparkling wine. Expect the speedrun to slow down.
[11:13-11:20 AM CT] Best uniform pure-SGD run so far was idea_027: LR `1e-6`, batch `524k`, no momentum/Adam/Muon. It reached val_loss `6.79449` at step 500, train_time `82.1s`.
[11:20-11:25 AM CT] Using SGD’s memory headroom for 2× batch fit and kept GPUs full, but did not help: idea_028 reached val_loss `7.06098` at step 250, train_time `112.2s`. More batch was worse per second, not better.
[11:30-11:45 AM CT] Layerwise LR scaling was tested across embeddings/head/scalars/attention/MLP and actual depth scaling. It helps a little, but not enough. Best 125-step screen was idea_044 at val_loss `7.20629`, better than uniform SGD’s `7.37` at the same point.
[11:45-11:52 AM CT] Longer best layerwise run idea_055: embed/scalar `1e-6`, head `3e-6`, matrices `5e-7`, batch `524k`. It reached val_loss `6.37802` at step 1000, train_time `161.6s` / real time `197.9s`.
[Current read] SGD is not losing because of GPU utilization or obvious wall-clock overhead. The steady step time is competitive and memory is lower, but optimizer dynamics are massively behind Muon. At roughly comparable early wall-clock, Muon is already around `3.8` or better while best clean SGD is still around `6.4-6.8`.
[Next useful experiment] The only still-plausible cheap knobs are small layerwise refinements around idea_044/055 and maybe coupled classical L2. I would not spend much more budget on bigger batch unless the goal is specifically to document that SGD memory headroom does not translate into faster loss reduction here.
not looking good for minimalists, but I aint losing hope
my father in law is preparing Ceviche and he just passed me a glass of sparkling water. Wait it was actually sparkling wine. Expect the speedrun to slow down.
@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?
@giffmana no momentum
[11:04 AM CT] idea_008 was also bad: val_loss rose to ~14.8 by ~475 steps, so we learned the raw SGD LR scale was still wrong.
[Key diagnosis] The benchmark loss uses reduction="sum" and gradients are summed across ranks. For vanilla SGD, LR must be orders smaller than Adam/Muon-style LR numbers.
[~11:07 AM CT] New log-scale pure-SGD probes were generated: no Adam/Muon/momentum, 125-step screens, validation every 25 steps, LR range roughly 1e-8 to 1e-6.
[11:08-11:09 AM CT] idea_020 with LR 1e-7 was stable but slow: step 125 val_loss=8.17933, train_time 55.9s.
[11:09-11:10 AM CT] idea_021 with LR 3e-7 improved: step 125 val_loss=7.73179, train_time 22.4s.
[11:11-11:12 AM CT] idea_022 with LR 1e-6 improved further: step 125 val_loss=7.37449, train_time 22.4s.
[Current read] SGD is faster per steady step and uses less optimizer memory, but it is still far behind Muon in loss at the same early step count: Muon step 125 was 4.67992; best SGD probe so far is 7.37449.
[Next useful experiment] Let the best stable SGD setting run to 500 or 1000 steps, then try larger microbatch/batch settings that use SGD’s lower optimizer memory, while keeping GPU util near 100%.
[SGD first probe] The first pure no-state SGD run used batch 524k, no Adam/Muon, no momentum, but LR was too high: validation loss stayed around 11, so it was cut early.
[11:02:03 AM CT] Lower-LR SGD probe idea_008 was launched in the parent run: batch 524k, no weight decay, 500-step screen.
[11:13-11:20 AM CT] Best uniform pure-SGD run so far was idea_027: LR `1e-6`, batch `524k`, no momentum/Adam/Muon. It reached val_loss `6.79449` at step 500, train_time `82.1s`.
[11:20-11:25 AM CT] Using SGD’s memory headroom for 2× batch fit and kept GPUs full, but did not help: idea_028 reached val_loss `7.06098` at step 250, train_time `112.2s`. More batch was worse per second, not better.
[11:30-11:45 AM CT] Layerwise LR scaling was tested across embeddings/head/scalars/attention/MLP and actual depth scaling. It helps a little, but not enough. Best 125-step screen was idea_044 at val_loss `7.20629`, better than uniform SGD’s `7.37` at the same point.
[11:45-11:52 AM CT] Longer best layerwise run idea_055: embed/scalar `1e-6`, head `3e-6`, matrices `5e-7`, batch `524k`. It reached val_loss `6.37802` at step 1000, train_time `161.6s` / real time `197.9s`.
[Current read] SGD is not losing because of GPU utilization or obvious wall-clock overhead. The steady step time is competitive and memory is lower, but optimizer dynamics are massively behind Muon. At roughly comparable early wall-clock, Muon is already around `3.8` or better while best clean SGD is still around `6.4-6.8`.
[Next useful experiment] The only still-plausible cheap knobs are small layerwise refinements around idea_044/055 and maybe coupled classical L2. I would not spend much more budget on bigger batch unless the goal is specifically to document that SGD memory headroom does not translate into faster loss reduction here.
[11:04 AM CT] idea_008 was also bad: val_loss rose to ~14.8 by ~475 steps, so we learned the raw SGD LR scale was still wrong.
[Key diagnosis] The benchmark loss uses reduction="sum" and gradients are summed across ranks. For vanilla SGD, LR must be orders smaller than Adam/Muon-style LR numbers.
[~11:07 AM CT] New log-scale pure-SGD probes were generated: no Adam/Muon/momentum, 125-step screens, validation every 25 steps, LR range roughly 1e-8 to 1e-6.
[11:08-11:09 AM CT] idea_020 with LR 1e-7 was stable but slow: step 125 val_loss=8.17933, train_time 55.9s.
[11:09-11:10 AM CT] idea_021 with LR 3e-7 improved: step 125 val_loss=7.73179, train_time 22.4s.
[11:11-11:12 AM CT] idea_022 with LR 1e-6 improved further: step 125 val_loss=7.37449, train_time 22.4s.
[Current read] SGD is faster per steady step and uses less optimizer memory, but it is still far behind Muon in loss at the same early step count: Muon step 125 was 4.67992; best SGD probe so far is 7.37449.
[Next useful experiment] Let the best stable SGD setting run to 500 or 1000 steps, then try larger microbatch/batch settings that use SGD’s lower optimizer memory, while keeping GPU util near 100%.
@DimitrisPapail I can help with this.
not looking good for minimalists, but I aint losing hope
@giffmana im comparing wallclock, steps is weird with SGD since it uses less flops and memory per step
@DimitrisPapail I think that will be hard, maybe with really big batch size. But then, i think since they count steps, batch size must be fixed for the speedrun?
AI Judge changed title after evaluation, original title: "Dimitris Papailiopoulos benchmarks vanilla SGD against advanced optimizers, drawing critique from Lucas Beyer over speedrun batch size constraints"
The five-hour benchmark runs on eight H100 GPUs.
gonna use codex gpt 5.5 xhigh on this and 8xh100. If you don't hear back in 5 hours, it means I bailed.
wish my runpod credits luck guys