https://arxiv.org/abs/2605.19095
Schedule-free learning in larger scales!
Experiments reach 2.01 loss on models up to 500 million parameters.
https://arxiv.org/abs/2605.19095
Schedule-free learning in larger scales!
Users are excited about ScheduleFree+ and related schedule-free methods for training large language models because they speed up experiments and incorporate effective techniques like inverse gradient norm scaling.
🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1
With the use of additional warmup, the reintroduction of AdamW momentum, and a modified Polyak step size rule, Schedule-Free Learning outperforms classical cosine and linear decay schedules at longer TPP budgets. Short TPP budgets (20-100) don't show any benefit.
Reference Implementation here: https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py
🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

@rosinality @ryu0000000001 @aaron_defazio is the key modification that we now do AdamW->schedule_free instead of RMSprop->schedule_free?

Yes that’s critical, but the use of inverse gradient norm scaling is also key, and annealing sf-beta over time makes a big difference for the long duration runs. It’s a combination of factors to make it work well. The use of the Polyak step size is exciting. No learning rate tuning!

@Clashluke @rosinality @aaron_defazio The code is here BTW. The doc string seems informative https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

@rosinality Code : https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

@aaron_defazio I haven't read the specific paper yet - does this method work for tasks that aren't large language models?

@aaron_defazio Exciting work, I'm a big fan of ScheduleFree and Prodigy, they really help with experiments iteration, what do you think of this implementation https://github.com/LoganBooker/prodigy-plus-schedule-free ?
Experiments reach 2.01 loss on models up to 500 million parameters.
https://arxiv.org/abs/2605.19095
Schedule-free learning in larger scales!