/Tech19h ago

Sadhika Malladi launches a deep learning blog, showing how the Adam optimizer succeeded despite flawed convergence proofs

The analysis explores why empirical success precedes formal mathematical guarantees

133146632365.3K

#92

Original post

Sadhika Malladi@SadhikaMalladi

I am starting a blog about deep learning theory and its value to practitioners! First post is about Adam, broken convergence proofs, and what theory can contribute when stuff just works anyways without it. Subscribe on Substack if you like it! https://undertheassumptions.substack.com/p/the-optimizer-that-outlived-its-proof

9:57 AM · Jun 30, 2026 · 65.1K Views

Sentiment

Users praised Sadhika Malladi's new blog on Adam Optimizer for clearly highlighting gaps in deep learning convergence proofs and offering helpful distinctions in optimization settings.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

The optimizer that outlived its proof

SUBSTACK.COMVia

#646

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS14LIKES16

Sanjeev Arora@prfsanjeevarora

A blog to follow! I learned a lot from @SadhikaMalladi's explorations and explanations, even when she was in her first year in PhD.

Sadhika Malladi@SadhikaMalladi

2h2.4K1614

RETWEETS2

Shenyang Deng ✈️ ICML2026@DengShenyang24

Classical optimization theory usually consider the convergence on a problem class，so the bound you can get is the result on worse case problem for algorithm. OCO or Non-Convex Smooth are two typical problem class. The worse case problem for algo may far away from NNs optimization problem.

If we want to prove the advantage of a certain algorithm（Why some algorithm specific works for NNs？），we may need to see more concrete problem.

2d2.2K123

REPLIES1

Sadhika Malladi@SadhikaMalladi

Thanks, that distinction is helpful. My understanding is that there are several differences in setting beyond stochasticity: Reddi et al. consider an online convex/adversarial framework, while your paper studies an offline smooth non-convex setting with a fixed objective. This is also why, at the bottom of p. 2, your paper describes the two results as “incomparable.” That said, I see your point that the deterministic/full-batch result is relevant context for the broader Adam convergence literature, and I’m happy to add a brief citation.

1d402

Anuj Apte@anujsapte

Very nice post ! I think another big gap in the literature or "theory" of optimization for DL is that most convergence proofs say nothing about constants while all the real gains are in the constants . E,.g: Muon is 20 percent faster than AdamW or something else that cannot be gleaned from the asymptotic analysis.

2d1K11

Anirbit@anirbit_maths

@SadhikaMalladi I wish you cited our work - which gave the *first * convergence proof for determinstic/full-batch Adam. That was a strong complement to the Reddi et. al. work that happened in parallel.

1d24421

deep Manifold@BetaTomorrow

@SadhikaMalladi please see

2d25011

Sadhika Malladi@SadhikaMalladi

Thanks for reading and engaging with the post! It is focused on a quite narrow point and not meant to be a comprehensive survey. I highlighted Reddi et al. because their paper identified an issue with the original Adam convergence proof, and that episode drove the main narrative of the post around the value of deep learning theory. The community has since published many exciting results to improve our understanding of Adam, and maybe I will get around to writing more about that in a future post!

1d1681

Anirbit@anirbit_maths

@SadhikaMalladi Our original point was that the issue identified by Reddi et. al. is an effect of stochasticity and *not* an issue with Adam. We pointed out right then that the same standard Adam had no convergence issues when used in the full-batch setting.

1d43

Sadhika Malladi@SadhikaMalladi

@anujsapte Excellent point!

2d5851

Anirbit@anirbit_maths

@SadhikaMalladi that's a correct summary :)

1d15