12h ago

Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines

Experiments reach 2.01 loss on models up to 500 million parameters.

0
Original post

https://arxiv.org/abs/2605.19095 Schedule-free learning in larger scales!

12:53 AM · May 20, 2026 View on X

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

6:13 PM · May 20, 2026 · 4.6K Views

With the use of additional warmup, the reintroduction of AdamW momentum, and a modified Polyak step size rule, Schedule-Free Learning outperforms classical cosine and linear decay schedules at longer TPP budgets. Short TPP budgets (20-100) don't show any benefit.

Reference Implementation here: https://github.com/facebookresearch/schedule_free/blob/main/schedulefree/adamc_schedulefree_plus_paper.py

Aaron DefazioAaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! https://arxiv.org/abs/2605.19095v1

6:13 PM · May 20, 2026 · 4.6K Views
6:13 PM · May 20, 2026 · 501 Views
Aaron Defazio submits ScheduleFree+ paper to arXiv extending schedule-free optimization to large language models with lower final loss than linear decay and WSD baselines · Digg