it's really, really hard to improve much on DeepSeek No matter how much compute you have, odds are that your grid searches won't find a much better global optimum. By V1, they were locked in I remember @andrew_n_carr getting a lot of use out of their hparams after V2
Everyone trains neural nets with learning-rate warmup + decay. But is that shape actually optimal — or just habit?
We searched huge families of schedules and found warmup & decay emerge on their own, even when we don't build them in.
New paper, now in TMLR 🧵👇 (1/7)
