Momentum accelerates training, but are the savings compute or serial runtime? New work: we prove compute-efficiency (CE: total compute to reach a target loss, not serial steps) lower bounds for stochastic Heavy Ball (HB) and Accelerated SGD (ASGD) [Kidambi et al., 2018]
Researchers Release Preprint Analyzing SGD and ASGD Efficiency Across Batch Sizes
Users praise the new work proving compute-efficiency lower bounds for stochastic momentum methods because of its amazing collaborators.
No Digg Deeper questions have been answered for this story yet.
Most Activity
For linear regression on Gaussian covariates, we show HB improves serial runtime over SGD, but it does not improve the CE frontier. Takeaway: HB raises the critical batch size — you can use bigger batches to cut serial steps — but needs same compute as SGD to hit a target loss.
Momentum accelerates training, but are the savings compute or serial runtime? New work: we prove compute-efficiency (CE: total compute to reach a target loss, not serial steps) lower bounds for stochastic Heavy Ball (HB) and Accelerated SGD (ASGD) [Kidambi et al., 2018]
Work done with amazing collaborators: @depen_morwani, @alexmeterez, @pranavn1008 . Preprint: https://arxiv.org/abs/2606.19179
We extend to ASGD, a momentum variant with an extra buffer, which achieves a better serial runtime at batch size 1. For power law spectra w/ fast-decaying exponents, ASGD improves small-batch CE, but as batch size grows, it trades that advantage for better serial runtime.
We extend to ASGD, a momentum variant with an extra buffer, which achieves a better serial runtime at batch size 1. For power law spectra w/ fast-decaying exponents, ASGD improves small-batch CE, but as batch size grows, it trades that advantage for better serial runtime.
For linear regression on Gaussian covariates, we show HB improves serial runtime over SGD, but it does not improve the CE frontier. Takeaway: HB raises the critical batch size — you can use bigger batches to cut serial steps — but needs same compute as SGD to hit a target loss.
Tagging some folks: @QuanquanGu, @jainprateek_, @PNetrapalli, @aaron_defazio, @BachFrancis, @konstmish, @ddrusvyat, @KempnerInst, @_arohan_, @jxbz
Work done with amazing collaborators: @depen_morwani, @alexmeterez, @pranavn1008 . Preprint: https://arxiv.org/abs/2606.19179