Cameron R. Wolfe argues that training optimizers like SGD and Muon shape model behaviors despite identical validation loss

Original post

@DimitrisPapail SGD would definitely be less chill. Muon would be the gen z machine god.

I have a silly shower thought:

Say you train machine god and you reach some Val loss you’re happy with. But there are multiple models with the same loss.

Does it have different “personality traits” depending the optimizer? Are aspects of the optimizer path affecting final model behavior?

Is machine god trained with SGD perhaps more chill than with Muon?

Perhaps!

7:33 AM · Jun 13, 2026 · 353 Views