/Tech34d ago

AMUSE optimizer merges Muon and schedule-free gradient evaluation to train models without learning rate decay

Lucas Nestler released a code implementation called SFMuon.

6123117414.9K

#957

Original post

Simo Ryu@cloneofsimo#957inTech

Cool + Prof. Chulhee was my defense's committee and he is really kind

Jueun Kim@jueunkim_0525

🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation

AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay.

• Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.

12:30 AM · May 26, 2026 · 5.5K Views

Sentiment

Positive users praise the AMUSE Optimizer paper because they view the choice of prof chulhee as solid and call the name clever.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS9.4KBOOKMARKS59LIKES81RETWEETS11REPLIES2

Lucas Nestler@Clashluke

class SFMuon(C.ScheduleFree): def __init__(self, params, **kw): d = dict(lr=.02, beta=.95, weight_decay=0, warmup_steps=0, weight_lr_power=2., r=0.) d.update(kw) super().__init__(params, d, fns=(C.nesterov_ema, C.orthogonalize_update, C.update_by_schedule_free))

Jueun Kim@jueunkim_0525

🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation

AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay.

• Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.

34d9.4K8159

Alex UGift@Radipdegen

@cloneofsimo fancy having him on your committee mustve felt good lol

paper name is clever too ngl

34d32

Strata@ChainZenit

@cloneofsimo prof chulhee sounds like a solid pick

34d28

Lucas Nestler@Clashluke

This implements the fixed-beta version which performs similarly to the hand-tuned-schedule variant.

The full implementation is:

_raw_sf = C.update_by_schedule_free.fn.fn

def _amuse_beta(group): step = group.get('_group_step') if step is None: return group['amuse_beta1'] step, w = max(int(step), 1), group['warmup_steps'] if step <= w or w <= 1: return group['amuse_beta1'] return 1 - ((w - 1) / (step - 1)) ** group['amuse_rho'] * (1 - group['amuse_beta1'])

@C.zero_guard('momentum') @C.no_state def muon_ema(group, update, grad, param, momentum): return utils.nesterov_ema(momentum, update, group['muon_mu'])

@C.copy_guard(2, 'z') @C.no_state def amuse_sf(group, update, grad, param, z): group['beta'] = _amuse_beta(group) return _raw_sf(group, update, grad, param, z)

class AMUSE(C.ScheduleFree): def __init__(self, params, **kw): d = dict(lr=.02, beta=.6, muon_mu=.95, weight_decay=0, warmup_steps=0, weight_lr_power=2., r=0., amuse_beta1=.6, amuse_rho=.8) d.update(kw) super().__init__(params, d, fns=(muon_ema, C.orthogonalize_update, amuse_sf))

34d20

Guilherme O'Tina@guilhermeotina

@cloneofsimo looking at the 50k step chart, most of these converge to basically the same loss by step ~40k. the real difference is the first 10-15k steps where they separate. so optimizer choice seems more about training speed to a given loss than final loss itself

34d10