Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines · Digg

Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines · Digg

Posts from X

Most Activity

VIEWS1.1KBOOKMARKS4LIKES14

Aaron Defazio@aaron_defazio

With the use of additional warmup, the reintroduction of AdamW momentum, and a modified Polyak step size rule, Schedule-Free Learning outperforms classical cosine and linear decay schedules at longer TPP budgets. Short TPP budgets (20-100) don't show any benefit.

Reference Implementation here: https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

40d1.1K144

RETWEETS50

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

40d79.4K404298

REPLIES2

Lucas Nestler@Clashluke

@rosinality @ryu0000000001 @aaron_defazio is the key modification that we now do AdamW->schedule_free instead of RMSprop->schedule_free?

40d29621

Aaron Defazio@aaron_defazio

Yes that’s critical, but the use of inverse gradient norm scaling is also key, and annealing sf-beta over time makes a big difference for the long duration runs. It’s a combination of factors to make it work well. The use of the Polyak step size is exciting. No learning rate tuning!

40d4431

ryu@ryu0000000001

@Clashluke @rosinality @aaron_defazio The code is here BTW. The doc string seems informative https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

40d574

ryu@ryu0000000001

@rosinality Code : https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

40d1361

Jade@JadeYi25

@aaron_defazio I haven't read the specific paper yet - does this method work for tasks that aren't large language models?

40d471

sileod@dmnsl1

@aaron_defazio Exciting work, I'm a big fan of ScheduleFree and Prodigy, they really help with experiments iteration, what do you think of this implementation https://github.com/LoganBooker/prodigy-plus-schedule-free ?

40d135