/AI9h ago

Microsoft Report Shows Less LR Decay Boosts LLM Post-RL Performance

12134338010.4K

Original post

Jeremy Cohen@deepcohen#685inAI

The recent Microsoft AI report noted that too much learning rate decay during pretraining hurts post-RL performance. This is actually just the latest of several papers this year pointing out that small learning rates can be harmful in LLM pretraining. (Thread)

5:37 PM · Jun 7, 2026 · 6.7K Views

Sentiment

Users appreciate the Microsoft report for providing good science on how less learning rate decay can boost LLM performance after reinforcement learning.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS838BOOKMARKS6LIKES11RETWEETS2REPLIES2

Jeremy Cohen@deepcohen

Background: in old-school vision tasks like ImageNet/CIFAR, it was widely observed that larger learning rates often yielded better generalization performance than smaller ones. But, for a long time, it was unclear whether an analogous point held for LLMs. https://arxiv.org/abs/1711.04623

Jeremy Cohen@deepcohen

The recent Microsoft AI report noted that too much learning rate decay during pretraining hurts post-RL performance. This is actually just the latest of several papers this year pointing out that small learning rates can be harmful in LLM pretraining. (Thread)

9h838116

Jeremy Cohen@deepcohen

It seems from a number of recent works that the answer is yes: * Watts, Li et al (https://arxiv.org/abs/2605.02105) found that pretraining with a larger peak LR, or a shorter LR decay period, causes the model to better retain its pretraining performance after fine-tuning.

Jeremy Cohen@deepcohen

Background: in old-school vision tasks like ImageNet/CIFAR, it was widely observed that larger learning rates often yielded better generalization performance than smaller ones. But, for a long time, it was unclear whether an analogous point held for LLMs. https://arxiv.org/abs/1711.04623

9h65284

Jeremy Cohen@deepcohen

Nevertheless, hyperparameters matter, and I'm glad we're starting to see good science about LR schedules for LLMs.

PS: as noted by Catalan-Tatjer et al, weight-averaging recovers many benefits of LR decay but without increasing sharpness. More should consider weight-averaging!

Jeremy Cohen@deepcohen

Now, why do lower-curvature models exhibit all those favorable properties? IMO, we still lack a completely satisfactory answer to that question. To say "flatter minima = better" is too simplistic; for example, when fine-tuning, people usually report that higher LR's are *bad*

9h40761

Jeremy Cohen@deepcohen

That was for an MLP, but the same thing happens in LLM training: learning rate decay causes curvature to grow.

(figures from: https://arxiv.org/abs/2604.13627, https://arxiv.org/abs/2510.06213, https://arxiv.org/abs/2603.16127)

For a way to quantify this effect precisely, see: https://arxiv.org/abs/2410.24206

Jeremy Cohen@deepcohen

For example, in this figure from https://arxiv.org/abs/2103.00065, you can see that cutting the learning rate during gradient descent training causes the curvature (top Hessian eigenvalue) to immediately rise:

9h24621

Jeremy Cohen@deepcohen

* Catalan-Tatjer et al (https://arxiv.org/abs/2510.06213) found that pretraining with a larger peak LR resulted in better performance after quantization, and that post-quantization performance degrades during LR decay.

Jeremy Cohen@deepcohen

It seems from a number of recent works that the answer is yes: * Watts, Li et al (https://arxiv.org/abs/2605.02105) found that pretraining with a larger peak LR, or a shorter LR decay period, causes the model to better retain its pretraining performance after fine-tuning.

9h32440

Jeremy Cohen@deepcohen

Now, why do lower-curvature models exhibit all those favorable properties? IMO, we still lack a completely satisfactory answer to that question. To say "flatter minima = better" is too simplistic; for example, when fine-tuning, people usually report that higher LR's are *bad*

Jeremy Cohen@deepcohen

That was for an MLP, but the same thing happens in LLM training: learning rate decay causes curvature to grow.

(figures from: https://arxiv.org/abs/2604.13627, https://arxiv.org/abs/2510.06213, https://arxiv.org/abs/2603.16127)

For a way to quantify this effect precisely, see: https://arxiv.org/abs/2410.24206

9h30930

Jeremy Cohen@deepcohen

For example, in this figure from https://arxiv.org/abs/2103.00065, you can see that cutting the learning rate during gradient descent training causes the curvature (top Hessian eigenvalue) to immediately rise:

Jeremy Cohen@deepcohen

Why does the learning rate used during training affect the resulting model? The answer seems to revolve around loss function curvature: large-LR optimizer dynamics implicitly regularize the curvature of the loss, preventing the optimizer from moving into high-curvature regions.

9h24930

Jeremy Cohen@deepcohen

Oh also, in the *multi-epoch* LLM setting, there is evidence that larger LR's yield better population pretraining loss, exactly mirroring what was known in 'classical' image settings (https://arxiv.org/abs/2306.08590). (This experiment is with batch size, but LR should be the same)

Jeremy Cohen@deepcohen

Nevertheless, hyperparameters matter, and I'm glad we're starting to see good science about LR schedules for LLMs.

PS: as noted by Catalan-Tatjer et al, weight-averaging recovers many benefits of LR decay but without increasing sharpness. More should consider weight-averaging!

6h18421

Jeremy Cohen@deepcohen

Why does the learning rate used during training affect the resulting model? The answer seems to revolve around loss function curvature: large-LR optimizer dynamics implicitly regularize the curvature of the loss, preventing the optimizer from moving into high-curvature regions.

Jeremy Cohen@deepcohen

* Yano et al (https://arxiv.org/abs/2603.16127) found that learning rate decay during pretraining hurts performance on evals after SFT.

9h26320

Jeremy Cohen@deepcohen

* Yano et al (https://arxiv.org/abs/2603.16127) found that learning rate decay during pretraining hurts performance on evals after SFT.

Jeremy Cohen@deepcohen

* Catalan-Tatjer et al (https://arxiv.org/abs/2510.06213) found that pretraining with a larger peak LR resulted in better performance after quantization, and that post-quantization performance degrades during LR decay.

9h27830