/Tech5h ago

Adam Block argues training procedures must adapt when language models deploy averaged weights like EMA instead of the final iterate

Story Overview

The preprint highlights a core mismatch in language model workflows: optimizers are built to chase the final training checkpoint, yet pipelines often ship an exponential moving average of the weights instead, leaving performance on the table when those two targets diverge.

12164814339.4K

#102

Original post

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

6:05 AM · Jun 25, 2026 · 25.9K Views

Developer Impact

A Drop-In Fix With Theoretical Backing

Block and Zhang propose BEMA, a bias-corrected averaging scheme presented as a two-line swap that preserves variance reduction while cutting lag, backed by a model deriving its optimality plus experiments focused on small-batch fine-tuning stability.

Open Question

Real-World Reach Still Unmapped

No quantitative gains, benchmark lists, or pipeline adoption details are established yet, so whether the approach scales beyond the reported fine-tuning regime or influences production training remains an open question.

Sentiment

Users praise the preprint's proposed optimizer changes for EMA model returns as a clean, great idea worth investigating.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS13.1KBOOKMARKS33LIKES51RETWEETS3

Dylan Foster 🐢@canondetortugas

The legendary Adam Block is now on twitter!!

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

5h13.1K5133

REPLIES1

rohan anil@_arohan_

Very cool work with connections to the schedule free work and iterative averaging work.

Interestingly, a natural step is to derive it for vanilla shampoo because it keeps the pre-conditioner around (b2>0) around. Essentially replace diagonal with the kronecker factors

Could be cool extend the per coordinate scheme to various other variance schemes.

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

19m912107

Lucas Beyer (bl16)@giffmana

@adam_block66985 Very nice idea!

32m35312

Adam Block@adam_block66985

@GeekParkHQ Great question! The answer is the former. On a related note, although we did observe that PACE can at times substitute for lr decay (e.g. WSD at different token budgets), our best runs still used some form of decay.

3h1855

Adam Block@adam_block66985

@_arohan_ Yes that is a great idea and we are actively investigating similar approaches!

17m201

Adam Block@adam_block66985

(3/14) If we know we will return the average, why are we still optimizing as though we will return the final iterate?

PACE is a first attempt to take this question seriously.

5h71

Adam Block@adam_block66985

(10/14) These gains persist across learning-rate schedules, learning rates, and EMA powers and the gains are robust across the pullback strengths.

5h61

Adam Block@adam_block66985

(11/14) The same pattern appears in pretraining.

On GPT-2 124M trained on FineWeb at the Chinchilla-optimal token budget, PACE improves over AdamW and EMA under constant LR, WSD, and cosine decay schedules.

5h61

Adam Block@adam_block66985

(8/14) In the quadratic setting PACE was designed for, the returned average can be strictly better than the uncontrolled average.

On some instances, it can be arbitrarily better.

5h51

Adam Block@adam_block66985

(9/14) Empirically, the effect is simple to see in fine-tuning.

Across three 1–2B LMs—SmolLM2-1.7B, Qwen3-1.7B, and Gemma3-1B—PACE improves over AdamW and EMA-evaluated AdamW.

5h51

Adam Block@adam_block66985

(6/14) After the AdamW step, PACE pulls the live weights toward the EMA weights with a clipped, per-coordinate gain.

In practice, this is a small modification to a standard training loop.

5h51

Adam Block@adam_block66985

(1/14) Led by my amazing master’s student Kwok Chun Au, we derive a simple “pullback” intervention that improves training empirically and in theory.

5h17

Adam Block@adam_block66985

(13/14) Takeaway: If you are going to return an averaged model, the optimizer should know that.

PACE uses the average not only at the end, but as a control signal during training.

Paper: https://arxiv.org/abs/2606.25086

5h14

Adam Block@adam_block66985

(2/14) We already know EMA works.

Iterate averaging is used throughout deep learning, and many modern LM pipelines return an averaged model.

But optimizer design largely ignores this.

5h11

Adam Block@adam_block66985

(12/14) PACE also has a broad basin over pullback strengths.

So while it introduces a new control parameter, performance does not depend on finding a single knife-edge value.

5h7

Adam Block@adam_block66985

(5/14) The solution says: pull the live iterate toward an estimate of the optimum and toward consistency with the accumulated average.

This is the conceptual origin of PACE.

5h7

Adam Block@adam_block66985

(4/14) Our starting point is an idealized control problem.

In a noisy quadratic model of optimization, we ask for the control that minimizes the error of the returned average, while penalizing how much we intervene.

5h7

Adam Block@adam_block66985

(7/14) PACE is not just a heuristic from a toy model.

For convex losses, a stylized version gets the standard stochastic optimization rate, up to an EMA-dependent factor.

5h6

Max Simchowitz@max_simchowitz

Welcome @AdamB1438 to Twitter! His new paper rethinks how to leverage iterate averaging for much, much faster optimization. Really worth the read.

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

2h60022

GeekPark@GeekParkHQ

@adam_block66985 Really clean writeup, thanks for walking through the derivation! Quick question here: Pullback toward the EMA is also an implicit per-coordinate LR decay. Is the gain the average-as-control signal, or a step-size effect a tuned schedule would recover?

4h17