AMUSE optimizer merges Muon and schedule-free gradient evaluation to train models without learning rate decay
Lucas Nestler released a code implementation called SFMuon.
class SFMuon(C.ScheduleFree): def __init__(self, params, **kw): d = dict(lr=.02, beta=.95, weight_decay=0, warmup_steps=0, weight_lr_power=2., r=0.) d.update(kw) super().__init__(params, d, fns=(C.nesterov_ema, C.orthogonalize_update, C.update_by_schedule_free))
🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay. • Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.