
Now, why do lower-curvature models exhibit all those favorable properties? IMO, we still lack a completely satisfactory answer to that question. To say "flatter minima = better" is too simplistic; for example, when fine-tuning, people usually report that higher LR's are *bad*