The recent Microsoft AI report noted that too much learning rate decay during pretraining hurts post-RL performance. This is actually just the latest of several papers this year pointing out that small learning rates can be harmful in LLM pretraining. (Thread)
Users appreciate the Microsoft report for providing good science on how less learning rate decay can boost LLM performance after reinforcement learning.
Most Activity
Background: in old-school vision tasks like ImageNet/CIFAR, it was widely observed that larger learning rates often yielded better generalization performance than smaller ones. But, for a long time, it was unclear whether an analogous point held for LLMs. https://arxiv.org/abs/1711.04623
The recent Microsoft AI report noted that too much learning rate decay during pretraining hurts post-RL performance. This is actually just the latest of several papers this year pointing out that small learning rates can be harmful in LLM pretraining. (Thread)
It seems from a number of recent works that the answer is yes: * Watts, Li et al (https://arxiv.org/abs/2605.02105) found that pretraining with a larger peak LR, or a shorter LR decay period, causes the model to better retain its pretraining performance after fine-tuning.
Background: in old-school vision tasks like ImageNet/CIFAR, it was widely observed that larger learning rates often yielded better generalization performance than smaller ones. But, for a long time, it was unclear whether an analogous point held for LLMs. https://arxiv.org/abs/1711.04623
Nevertheless, hyperparameters matter, and I'm glad we're starting to see good science about LR schedules for LLMs.
PS: as noted by Catalan-Tatjer et al, weight-averaging recovers many benefits of LR decay but without increasing sharpness. More should consider weight-averaging!
Now, why do lower-curvature models exhibit all those favorable properties? IMO, we still lack a completely satisfactory answer to that question. To say "flatter minima = better" is too simplistic; for example, when fine-tuning, people usually report that higher LR's are *bad*
That was for an MLP, but the same thing happens in LLM training: learning rate decay causes curvature to grow.
(figures from: https://arxiv.org/abs/2604.13627, https://arxiv.org/abs/2510.06213, https://arxiv.org/abs/2603.16127)
For a way to quantify this effect precisely, see: https://arxiv.org/abs/2410.24206
For example, in this figure from https://arxiv.org/abs/2103.00065, you can see that cutting the learning rate during gradient descent training causes the curvature (top Hessian eigenvalue) to immediately rise:
* Catalan-Tatjer et al (https://arxiv.org/abs/2510.06213) found that pretraining with a larger peak LR resulted in better performance after quantization, and that post-quantization performance degrades during LR decay.
It seems from a number of recent works that the answer is yes: * Watts, Li et al (https://arxiv.org/abs/2605.02105) found that pretraining with a larger peak LR, or a shorter LR decay period, causes the model to better retain its pretraining performance after fine-tuning.
Now, why do lower-curvature models exhibit all those favorable properties? IMO, we still lack a completely satisfactory answer to that question. To say "flatter minima = better" is too simplistic; for example, when fine-tuning, people usually report that higher LR's are *bad*
That was for an MLP, but the same thing happens in LLM training: learning rate decay causes curvature to grow.
(figures from: https://arxiv.org/abs/2604.13627, https://arxiv.org/abs/2510.06213, https://arxiv.org/abs/2603.16127)
For a way to quantify this effect precisely, see: https://arxiv.org/abs/2410.24206
For example, in this figure from https://arxiv.org/abs/2103.00065, you can see that cutting the learning rate during gradient descent training causes the curvature (top Hessian eigenvalue) to immediately rise:
Why does the learning rate used during training affect the resulting model? The answer seems to revolve around loss function curvature: large-LR optimizer dynamics implicitly regularize the curvature of the loss, preventing the optimizer from moving into high-curvature regions.
Oh also, in the *multi-epoch* LLM setting, there is evidence that larger LR's yield better population pretraining loss, exactly mirroring what was known in 'classical' image settings (https://arxiv.org/abs/2306.08590). (This experiment is with batch size, but LR should be the same)
Nevertheless, hyperparameters matter, and I'm glad we're starting to see good science about LR schedules for LLMs.
PS: as noted by Catalan-Tatjer et al, weight-averaging recovers many benefits of LR decay but without increasing sharpness. More should consider weight-averaging!
Why does the learning rate used during training affect the resulting model? The answer seems to revolve around loss function curvature: large-LR optimizer dynamics implicitly regularize the curvature of the loss, preventing the optimizer from moving into high-curvature regions.
* Yano et al (https://arxiv.org/abs/2603.16127) found that learning rate decay during pretraining hurts performance on evals after SFT.
* Yano et al (https://arxiv.org/abs/2603.16127) found that learning rate decay during pretraining hurts performance on evals after SFT.
* Catalan-Tatjer et al (https://arxiv.org/abs/2510.06213) found that pretraining with a larger peak LR resulted in better performance after quantization, and that post-quantization performance degrades during LR decay.