/Tech10h ago

Meta's Lucas Beyer questions whether the "Stochastic Grandpa" SGD baseline relies on momentum to match modern optimizers

The baseline trails modern optimizers by 10 compute hours.

114112.8K

#72

Original post

Lucas Beyer (bl16)@giffmana#72inTech

@DimitrisPapail This is with momentum or pure?

Dimitris Papailiopoulos@DimitrisPapail

It is looking very very good for Stochastic Grandpa

9:57 AM · Jun 12, 2026 · 1.4K Views

Sentiment

Users praise SGDM as a major step above plain SGD and alternatives like Adam based on their optimization experience.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.4KRETWEETS1

Ravid Shwartz Ziv@ziv_ravid

I love strong baselines 🤩

Dimitris Papailiopoulos@DimitrisPapail

It is looking very very good for Stochastic Grandpa

4h1.4K50

BOOKMARKS3

Lucas Beyer (bl16)@giffmana

@DimitrisPapail Maybe you are too young but this was absolutely a thing :D https://arxiv.org/abs/1608.03983 https://arxiv.org/abs/1506.01186 https://arxiv.org/abs/2008.01171

9h21233

LIKES10

Dimitris Papailiopoulos@DimitrisPapail

@giffmana heavy ball, so uses momentum. seems to be important after all

10h85810

REPLIES2

Dimitris Papailiopoulos@DimitrisPapail

@giffmana it also uses a very weird schedule. calling it wave SGD for now :D

10h4372

Lucas Beyer (bl16)@giffmana

@DimitrisPapail Yep that's my experience, sgdm has always been a huge step on top of sgd, more than anything else; more than adam on sgdm or modern things on adam.

9h9721

Dimitris Papailiopoulos@DimitrisPapail

@giffmana I wish I was. I’m 42 🥲. I remember these papers used to work in OPT 2014-2022

9h14831

Dimitris Papailiopoulos@DimitrisPapail

@plugyawn @giffmana it does but it shows existence.

8h26

Plugyawn@plugyawn

@DimitrisPapail @giffmana at some point the scheduler becomes an overfit risk, no?

9h18

Plugyawn@plugyawn

@DimitrisPapail @giffmana i'm probably wrong but there probably exists an arbitrary scheduler that makes SGD or any such algorithm converge in arbitrarily small time?

8h16

Plugyawn@plugyawn

@giffmana @DimitrisPapail Do you think it's just classic more parameters better at some level, is that not a call for dramatically more?

9h13