@DimitrisPapail This is with momentum or pure?
It is looking very very good for Stochastic Grandpa
The baseline trails modern optimizers by 10 compute hours.
@DimitrisPapail This is with momentum or pure?
It is looking very very good for Stochastic Grandpa
Users praise SGDM as a major step above plain SGD and alternatives like Adam based on their optimization experience.
I love strong baselines 🤩
It is looking very very good for Stochastic Grandpa

@DimitrisPapail Maybe you are too young but this was absolutely a thing :D https://arxiv.org/abs/1608.03983 https://arxiv.org/abs/1506.01186 https://arxiv.org/abs/2008.01171

@giffmana heavy ball, so uses momentum. seems to be important after all

@giffmana it also uses a very weird schedule. calling it wave SGD for now :D

@DimitrisPapail Yep that's my experience, sgdm has always been a huge step on top of sgd, more than anything else; more than adam on sgdm or modern things on adam.

@giffmana I wish I was. I’m 42 🥲. I remember these papers used to work in OPT 2014-2022

@plugyawn @giffmana it does but it shows existence.

@DimitrisPapail @giffmana at some point the scheduler becomes an overfit risk, no?

@DimitrisPapail @giffmana i'm probably wrong but there probably exists an arbitrary scheduler that makes SGD or any such algorithm converge in arbitrarily small time?

@giffmana @DimitrisPapail Do you think it's just classic more parameters better at some level, is that not a call for dramatically more?