Really great explanation of low and even negative cosine similarity of momentum and gradient, something I’ve been wondering on for a while.
Reminds me of the “river and valley interpretation”
Opens the question of whether we could imagine dynamic beta coefficients or dampening based on tracked rate of sign flips or variance.