I am starting a blog about deep learning theory and its value to practitioners! First post is about Adam, broken convergence proofs, and what theory can contribute when stuff just works anyways without it. Subscribe on Substack if you like it! https://undertheassumptions.substack.com/p/the-optimizer-that-outlived-its-proof
Sadhika Malladi launches a deep learning blog, showing how the Adam optimizer succeeded despite flawed convergence proofs
The analysis explores why empirical success precedes formal mathematical guarantees
Users praised Sadhika Malladi's new blog on Adam Optimizer for clearly highlighting gaps in deep learning convergence proofs and offering helpful distinctions in optimization settings.
No Digg Deeper questions have been answered for this story yet.
Most Activity
A blog to follow! I learned a lot from @SadhikaMalladi's explorations and explanations, even when she was in her first year in PhD.
I am starting a blog about deep learning theory and its value to practitioners! First post is about Adam, broken convergence proofs, and what theory can contribute when stuff just works anyways without it. Subscribe on Substack if you like it! https://undertheassumptions.substack.com/p/the-optimizer-that-outlived-its-proof

Classical optimization theory usually consider the convergence on a problem class,so the bound you can get is the result on worse case problem for algorithm. OCO or Non-Convex Smooth are two typical problem class. The worse case problem for algo may far away from NNs optimization problem.
If we want to prove the advantage of a certain algorithm(Why some algorithm specific works for NNs?) ,we may need to see more concrete problem.

Thanks, that distinction is helpful. My understanding is that there are several differences in setting beyond stochasticity: Reddi et al. consider an online convex/adversarial framework, while your paper studies an offline smooth non-convex setting with a fixed objective. This is also why, at the bottom of p. 2, your paper describes the two results as “incomparable.” That said, I see your point that the deterministic/full-batch result is relevant context for the broader Adam convergence literature, and I’m happy to add a brief citation.

Very nice post ! I think another big gap in the literature or "theory" of optimization for DL is that most convergence proofs say nothing about constants while all the real gains are in the constants . E,.g: Muon is 20 percent faster than AdamW or something else that cannot be gleaned from the asymptotic analysis.

@SadhikaMalladi I wish you cited our work - which gave the *first * convergence proof for determinstic/full-batch Adam. That was a strong complement to the Reddi et. al. work that happened in parallel.

@SadhikaMalladi please see

Thanks for reading and engaging with the post! It is focused on a quite narrow point and not meant to be a comprehensive survey. I highlighted Reddi et al. because their paper identified an issue with the original Adam convergence proof, and that episode drove the main narrative of the post around the value of deep learning theory. The community has since published many exciting results to improve our understanding of Adam, and maybe I will get around to writing more about that in a future post!

@SadhikaMalladi Our original point was that the issue identified by Reddi et. al. is an effect of stochasticity and *not* an issue with Adam. We pointed out right then that the same standard Adam had no convergence issues when used in the full-batch setting.

@anujsapte Excellent point!

@SadhikaMalladi that's a correct summary :)