Cameron R. Wolfe argues that training optimizers like SGD and Muon shape model behaviors despite identical validation loss · Digg