/Tech2h ago

Microsoft Research's Dimitris Papailiopoulos questions whether training optimizers like SGD and Muon create distinct model behaviors at equivalent validation loss

Path dependency can cause varying overconfidence on misclassified data.

8511235.9K

#84

Original post

Dimitris Papailiopoulos@DimitrisPapail#217inTech

I have a silly shower thought:

Say you train machine god and you reach some Val loss you’re happy with. But there are multiple models with the same loss.

Does it have different “personality traits” depending the optimizer? Are aspects of the optimizer path affecting final model behavior?

Is machine god trained with SGD perhaps more chill than with Muon?

Perhaps!

9:18 PM · Jun 12, 2026 · 4.2K Views

Sentiment

Some users praise a paper on whether optimizer choice shapes AI model behavior, calling it one of their favorites and very cool.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.8KBOOKMARKS7LIKES8

Susan Zhang@suchenzang

@DimitrisPapail 👀

Susan Zhang@suchenzang

been a while since i've seen such a well-articulated paper highlighting train-test gap, path-dependency of training dynamics on convergence, and more

it would be a funny stretch if a "better optimizer" now leads to "overconfidence on misclassified test examples", aka brittle sycophancy we now see in many frontier models... 👀

2h1.8K87

REPLIES2

ueaj@_ueaj

@DimitrisPapail one of my favorite papers of all time https://arxiv.org/abs/2507.12224

2h22777

Wenhao Chai@wenhaocha1

@DimitrisPapail yes https://jiaxin-wen.github.io/blog/generalization-dynamics.html

2h21542

Dimitris Papailiopoulos@DimitrisPapail

@_ueaj Huh!

2h11611

Tarun Kathuria@_TarunKathuria

Yes, this seems to happen a lot but I am not sure how to characterize. Also if you use one optimizer for pretraining like Adam and then switch it out to something else like matrix aware optimizers, then performance can be poor compared to sticking to Adam. It’s personally a problem I am very interested in understanding how to characterize or at least develop a unified recipe for.

2h1463

Dimitris Papailiopoulos@DimitrisPapail

@_TarunKathuria Ha, very cool!

2h791