/AI10h ago

Dimitri Bertsekas, Optimization and RL Pioneer, Passes Away

415414486.6K

#1397

Original post

Fabian Pedregosa#1397

Mark Schmidt@MarkSchmidtUBC

I am sad to hear of the passing of Dimitri Bertsekas. This one hurts.

Dimitri had a big effect on my career, from inspiring research topics to writing one of my tenure letters.

A long thread on memories of Bertsekas and some of his works that influenced me the most.

11:00 PM · Jun 5, 2026 · 6.6K Views

Sentiment

Many users fondly recalled Dimitri Bertsekas' elegant optimization methods, prize-winning papers, and insightful books as enduring resources that shaped their research and careers.

Pos

100.0%

Neg

0.0%

21 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS295BOOKMARKS1

Mark Schmidt@MarkSchmidtUBC

One of my favourite algorithms of Bersekas is the two-metric projection method. This lets you apply Newton-style methods to problems with upper and lower variable bounds. I think this is the most elegant way to generalize unconstrained Newton-style methods to this setting.

10h29521

LIKES3

Mark Schmidt@MarkSchmidtUBC

I used this approach in the first year of my PhD, after formulating L1-regularization problems as bound-constrained problems. Later, the first contribution of my thesis was extending the two-metric projection framework to directly address L1-regularization problems.

10h2513

REPLIES1

Mark Schmidt@MarkSchmidtUBC

One of Dimitri's major legacies will be his books, which are outstanding and on a whole host of topics. I have recommended a variety of books to various students over the years, and this will definitely continue.

10h822

Mark Schmidt@MarkSchmidtUBC

Later during my postdoc time, I found that Bertsekas' Neurodynamic Programming book was one of the best introductions to optimization with inexact gradients or stochastic gradients. Still a great reference, covering ideas that are lost in the focus of the current literature.

10h1272

Mark Schmidt@MarkSchmidtUBC

Bertsekas' books are a treasure trove of insights that people keep rediscovering. In his NLP book, the first constrained optimization algorithms he covers is Frank-Wolfe ("conditional gradient"), more than a decade before its modern resurgence (several decades pre-Muon).

10h1242

Mark Schmidt@MarkSchmidtUBC

Indeed, in UBC's Machine Learning Reading Group one of the next books we are likely to read is his book "Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control". When I visited ASU he that rollout is implementing Newton's method.

10h1162

Mark Schmidt@MarkSchmidtUBC

An even more prophetic example is his book with John Tsitsiklis on Parallel and Distribution Computation from way back in 1989. This book is outstanding and well before its time (it was before Bertsekas started "publishing his books out of his garage").

10h1112

Mark Schmidt@MarkSchmidtUBC

After reading the SAG paper, Dimitri said he went to my webpage and liked my style of work. He then arranged to fly to Vancouver to spend a day learning everything he could about optimization for machine learning from me for his new book.

10h1072

Mark Schmidt@MarkSchmidtUBC

Since the book is hard to get ahold of, this led to a huge number of re-inventions of ideas already in the book. From my memory, it already covered ideas like parallel coordinate descent, decomposition methods, and ADMM-like methods.

10h1042

Mark Schmidt@MarkSchmidtUBC

I first heard about the IAG method in a paper by Bertsekas discussing why the method was interesting. This link led to our SAG method which won the Lagrange Prize, and arguably initiated the field of fast variance-reduced stochastic gradient methods like SVRG and its successors.

10h1032

Mark Schmidt@MarkSchmidtUBC

When working on "active-set complexity". We were having a tough time slogging through the modern manifold identification papers, but we found that the first paper on this topic was a paper by Bertsekas in the 70s.

10h972

Mark Schmidt@MarkSchmidtUBC

Dimitri later became a letter-writer for me, which helped me obtaining various grants/awards as well as tenure. He even kept doing this after his retirement.

10h952

Mark Schmidt@MarkSchmidtUBC

This was once again a topic where a work by Bertsekas pointed to the the first references I encountered and had got thinking about the topic (to some neural network papers from the 90s).

10h932

Mark Schmidt@MarkSchmidtUBC

In 2023 I visited him at his ASU office, where he became a faculty member after deciding he didn't like retirement. I gave a talk on optimization for over-parameterized models.

10h932

Mark Schmidt@MarkSchmidtUBC

During my visit to ASU, we discussed various things including a story Dimitri told of visiting Polyak in Russia. Even though he will be best remembered at MIT, he seemed really happy at ASU and he said that he felt that he was having the most productive time of his career.

10h852

Mark Schmidt@MarkSchmidtUBC

My last interaction with Dimitri went back to the beginning, where I began a long e-mail thread with him about two-metric projection methods.

10h802

Mark Schmidt@MarkSchmidtUBC

On the other hand, Dimitri thought the over-parameterization assumption was too strong, and in the long run he is right (LLMs do not satisfy the type of over-parameterization we were studying).

10h802

Mark Schmidt@MarkSchmidtUBC

We were trying to implement his adaptation of these methods to optimize over the probability simplex, and for once his book had some incorrect information. It was an easy fix, and this reminds that I should bug one of my students to write this up: it is a really nice algorithm!

10h1111

Mark Schmidt@MarkSchmidtUBC

He just wanted to meet and didn't want to give a talk or meet with anyone, but I ultimately asked him to give a talk which packed an auditorium. It certainly reflected well on me as a first-year faculty member that he spoke highly of the SAG work in this talk.

10h981

Mark Schmidt@MarkSchmidtUBC

It gave such an elegant and simplified presentation that it became obvious within a couple days how we could derive non-asymptotic bounds on the time needed for algorithms to recover sparsity patterns.

10h921