/Tech21d ago

Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines

Experiments reach 2.01 loss on models up to 500 million parameters.

135197140286.7K

#464

Original post

Konstantin Mishchenko#1792

Rosinality@rosinality

https://arxiv.org/abs/2605.19095

Schedule-free learning in larger scales!

12:53 AM · May 20, 2026 · 6.2K Views

/Tech21d ago

Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines

Experiments reach 2.01 loss on models up to 500 million parameters.

135197140286.7K

#464

Original post

Konstantin Mishchenko#1792

Rosinality@rosinality

https://arxiv.org/abs/2605.19095

Schedule-free learning in larger scales!

12:53 AM · May 20, 2026 · 6.2K Views

Sentiment

Users are excited about ScheduleFree+ and related schedule-free methods for training large language models because they speed up experiments and incorporate effective techniques like inverse gradient norm scaling.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS79.4KBOOKMARKS298LIKES404RETWEETS53REPLIES7

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

21d79.4K404298

Aaron Defazio@aaron_defazio

With the use of additional warmup, the reintroduction of AdamW momentum, and a modified Polyak step size rule, Schedule-Free Learning outperforms classical cosine and linear decay schedules at longer TPP budgets. Short TPP budgets (20-100) don't show any benefit.

Reference Implementation here: https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

21d1.1K144

Lucas Nestler@Clashluke

@rosinality @ryu0000000001 @aaron_defazio is the key modification that we now do AdamW->schedule_free instead of RMSprop->schedule_free?

21d29621

Aaron Defazio@aaron_defazio

Yes that’s critical, but the use of inverse gradient norm scaling is also key, and annealing sf-beta over time makes a big difference for the long duration runs. It’s a combination of factors to make it work well. The use of the Polyak step size is exciting. No learning rate tuning!

21d4431

ryu@ryu0000000001

@Clashluke @rosinality @aaron_defazio The code is here BTW. The doc string seems informative https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

21d574

ryu@ryu0000000001

@rosinality Code : https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

21d1361

Jade@JadeYi25

@aaron_defazio I haven't read the specific paper yet - does this method work for tasks that aren't large language models?

21d471

sileod@dmnsl1

@aaron_defazio Exciting work, I'm a big fan of ScheduleFree and Prodigy, they really help with experiments iteration, what do you think of this implementation https://github.com/LoganBooker/prodigy-plus-schedule-free ?

21d135