13h ago

AMUSE optimizer merges Muon and schedule-free gradient evaluation to train models without learning rate decay

Lucas Nestler released a code implementation called SFMuon.

0
Original post

Cool + Prof. Chulhee was my defense's committee and he is really kind

12:30 AM · May 26, 2026 View on X

class SFMuon(C.ScheduleFree): def __init__(self, params, **kw): d = dict(lr=.02, beta=.95, weight_decay=0, warmup_steps=0, weight_lr_power=2., r=0.) d.update(kw) super().__init__(params, d, fns=(C.nesterov_ema, C.orthogonalize_update, C.update_by_schedule_free))

Jueun KimJueun Kim@jueunkim_0525

🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay. • Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.

4:20 AM · May 26, 2026 · 28.3K Views
8:19 AM · May 26, 2026 · 7.8K Views