btw this is a weird heavy ball SGD variant that basically does this 1. Load previous checkpoint weights. 2. Reset optimizer state / momentum buffers. 3. Train N steps. 4. For first M steps: warm LR from 0.1x -> 1.0x. 5. Hold LR flat until ~50% of the cycle. 6. Linearly decay LR to zero. 7. Save checkpoint. 8. Repeat
the optimizer is exactly this buf = mu * buf + grad p *= 1 - lr * wd p -= lr * buf / 524288
@CevherLIONS @_arohan_ does this have a name?
I'd call it wave SGD lol
BOOM shakalaka
now shortening the timeline :D
