/Tech4h ago

Microsoft’s Dimitris Papailiopoulos details ValiantSGD, an unconventional heavy-ball SGD optimizer that divides updates by 524,288

The training loop resets momentum buffers during cyclic checkpointing

1031563.5K

#72

Original post

Dimitris Papailiopoulos@DimitrisPapail#217inTech

btw this is a weird heavy ball SGD variant that basically does this 1. Load previous checkpoint weights. 2. Reset optimizer state / momentum buffers. 3. Train N steps. 4. For first M steps: warm LR from 0.1x -> 1.0x. 5. Hold LR flat until ~50% of the cycle. 6. Linearly decay LR to zero. 7. Save checkpoint. 8. Repeat

the optimizer is exactly this buf = mu * buf + grad p *= 1 - lr * wd p -= lr * buf / 524288

@CevherLIONS @_arohan_ does this have a name?

I'd call it wave SGD lol

Dimitris Papailiopoulos@DimitrisPapail

BOOM shakalaka

now shortening the timeline :D

9:41 AM · Jun 12, 2026 · 1.6K Views

Sentiment

Users praise the Wave SGD optimizer's ramp-cruise-anneal schedule and checkpoint cycling because it avoids unnecessary warm-ups while shaving log factors in convergence.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1KLIKES6REPLIES2

Dimitris Papailiopoulos@DimitrisPapail

basically it's doing this again and again

Dimitris Papailiopoulos@DimitrisPapail

the optimizer is exactly this buf = mu * buf + grad p *= 1 - lr * wd p -= lr * buf / 524288

@CevherLIONS @_arohan_ does this have a name?

I'd call it wave SGD lol

4h1K61

BOOKMARKS2RETWEETS1

Lucas Beyer (bl16)@giffmana

@DimitrisPapail Maybe you are too young but this was absolutely a thing :D https://arxiv.org/abs/1608.03983 https://arxiv.org/abs/1506.01186 https://arxiv.org/abs/2008.01171

Dimitris Papailiopoulos@DimitrisPapail

@giffmana it also uses a very weird schedule. calling it wave SGD for now :D

3h17532

Dimitris Papailiopoulos@DimitrisPapail

@giffmana it also uses a very weird schedule. calling it wave SGD for now :D

Dimitris Papailiopoulos@DimitrisPapail

basically it's doing this again and again

3h35420

rohan anil@_arohan_

@DimitrisPapail ValiantSGD

Dimitris Papailiopoulos@DimitrisPapail

the optimizer is exactly this buf = mu * buf + grad p *= 1 - lr * wd p -= lr * buf / 524288