I have a silly shower thought:
Say you train machine god and you reach some Val loss you’re happy with. But there are multiple models with the same loss.
Does it have different “personality traits” depending the optimizer? Are aspects of the optimizer path affecting final model behavior?
Is machine god trained with SGD perhaps more chill than with Muon?
Perhaps!



