/Tech2h ago

Adam Block argues training procedures must adapt when language models deploy averaged weights like EMA instead of the final iterate

Story Overview

The preprint highlights a core mismatch in language model workflows: optimizers are built to chase the final training checkpoint, yet pipelines often ship an exponential moving average of the weights instead, leaving performance on the table when those two targets diverge.

4492448.6K

#868

Original post

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

6:05 AM · Jun 25, 2026 · 6.5K Views

Developer Impact

A Drop-In Fix With Theoretical Backing

Block and Zhang propose BEMA, a bias-corrected averaging scheme presented as a two-line swap that preserves variance reduction while cutting lag, backed by a model deriving its optimality plus experiments focused on small-batch fine-tuning stability.

Open Question

Real-World Reach Still Unmapped

No quantitative gains, benchmark lists, or pipeline adoption details are established yet, so whether the approach scales beyond the reported fine-tuning regime or influences production training remains an open question.

Sentiment

Users praise the clean writeup and derivations in the preprint on training adjustments for EMA models in language pipelines.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4KBOOKMARKS13LIKES16

Dylan Foster 🐢@canondetortugas

The legendary Adam Block is now on twitter!!

Adam Block@adam_block66985

New preprint: Training for the model you return

Modern LM pipelines often return an averaged model (e.g. EMA), rather than the final iterate, but optimizers are still mostly designed around the final iterate.

How should we change training if we know we will return an EMA? 🧵

2h4K1613

REPLIES1

Adam Block@adam_block66985

(13/14) Takeaway: If you are going to return an averaged model, the optimizer should know that.

PACE uses the average not only at the end, but as a control signal during training.

Paper: https://arxiv.org/abs/2606.25086

2h14

Adam Block@adam_block66985

(3/14) If we know we will return the average, why are we still optimizing as though we will return the final iterate?

PACE is a first attempt to take this question seriously.

2h71

Adam Block@adam_block66985

(10/14) These gains persist across learning-rate schedules, learning rates, and EMA powers and the gains are robust across the pullback strengths.

2h61

Adam Block@adam_block66985

(11/14) The same pattern appears in pretraining.

On GPT-2 124M trained on FineWeb at the Chinchilla-optimal token budget, PACE improves over AdamW and EMA under constant LR, WSD, and cosine decay schedules.

2h61

Adam Block@adam_block66985

(8/14) In the quadratic setting PACE was designed for, the returned average can be strictly better than the uncontrolled average.

On some instances, it can be arbitrarily better.

2h51

Adam Block@adam_block66985

(9/14) Empirically, the effect is simple to see in fine-tuning.

Across three 1–2B LMs—SmolLM2-1.7B, Qwen3-1.7B, and Gemma3-1B—PACE improves over AdamW and EMA-evaluated AdamW.

2h51

Adam Block@adam_block66985

(6/14) After the AdamW step, PACE pulls the live weights toward the EMA weights with a clipped, per-coordinate gain.

In practice, this is a small modification to a standard training loop.

2h51

Adam Block@adam_block66985

(1/14) Led by my amazing master’s student Kwok Chun Au, we derive a simple “pullback” intervention that improves training empirically and in theory.

2h17

Adam Block@adam_block66985

(2/14) We already know EMA works.

Iterate averaging is used throughout deep learning, and many modern LM pipelines return an averaged model.

But optimizer design largely ignores this.

2h11

Adam Block@adam_block66985

(12/14) PACE also has a broad basin over pullback strengths.

So while it introduces a new control parameter, performance does not depend on finding a single knife-edge value.

2h7

Adam Block@adam_block66985

(5/14) The solution says: pull the live iterate toward an estimate of the optimum and toward consistency with the accumulated average.

This is the conceptual origin of PACE.

2h7

Adam Block@adam_block66985

(4/14) Our starting point is an idealized control problem.

In a noisy quadratic model of optimization, we ask for the control that minimizes the error of the returned average, while penalizing how much we intervene.

2h7

Adam Block@adam_block66985

(7/14) PACE is not just a heuristic from a toy model.

For convex losses, a stylized version gets the standard stochastic optimization rate, up to an EMA-dependent factor.

2h6

GeekPark@GeekParkHQ

@adam_block66985 Really clean writeup, thanks for walking through the derivation! Quick question here: Pullback toward the EMA is also an implicit per-coordinate LR decay. Is the gain the average-as-control signal, or a step-size effect a tuned schedule would recover?

1h17

Adam Block@adam_block66985

(14/14) Lots of questions remain.

Can we derive analogous interventions for other optimizers? How does PACE interact with Muon or SOAP? Can similar "training for the returned model" ideas help stabilize RL-style post-training?

2h10