New preprint: Training for the model you return
Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.
How should we change training if we know we will return an EMA? 🧵


