/Tech21d ago

Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines

Experiments reach 2.01 loss on models up to 500 million parameters.

135197140286.7K
Sentiment

Users are excited about ScheduleFree+ and related schedule-free methods for training large language models because they speed up experiments and incorporate effective techniques like inverse gradient norm scaling.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS79.4KBOOKMARKS298LIKES404RETWEETS53REPLIES7
Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

21dViews 79.4KLikes 404Bookmarks 298
Aaron Defazio@aaron_defazio

With the use of additional warmup, the reintroduction of AdamW momentum, and a modified Polyak step size rule, Schedule-Free Learning outperforms classical cosine and linear decay schedules at longer TPP budgets. Short TPP budgets (20-100) don't show any benefit.

Reference Implementation here: https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

21dViews 1.1KLikes 14Bookmarks 4
Lucas Nestler@Clashluke

@rosinality @ryu0000000001 @aaron_defazio is the key modification that we now do AdamW->schedule_free instead of RMSprop->schedule_free?

21dViews 296Likes 2Bookmarks 1
Aaron Defazio@aaron_defazio

Yes that’s critical, but the use of inverse gradient norm scaling is also key, and annealing sf-beta over time makes a big difference for the long duration runs. It’s a combination of factors to make it work well. The use of the Polyak step size is exciting. No learning rate tuning!

21dViews 44Likes 3Bookmarks 1
ryu@ryu0000000001

@Clashluke @rosinality @aaron_defazio The code is here BTW. The doc string seems informative https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

21dViews 57Likes 4
ryu@ryu0000000001

@rosinality Code : https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

21dViews 136Likes 1
Jade@JadeYi25

@aaron_defazio I haven't read the specific paper yet - does this method work for tasks that aren't large language models?

21dViews 47Likes 1
sileod@dmnsl1

@aaron_defazio Exciting work, I'm a big fan of ScheduleFree and Prodigy, they really help with experiments iteration, what do you think of this implementation https://github.com/LoganBooker/prodigy-plus-schedule-free ?

21dViews 135