15h ago

Paper presents empirical method for critical batch size

61204773

——0——

An arXiv paper by William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi introduces an empirical method for measuring critical batch size during transformer language model pretraining. The work identifies the largest effective batch size before gradient noise scale yields diminishing returns. Experiments adjust learning rates across batch sizes and examine warmup phases to preserve token efficiency and model accuracy.

Original post

elie#716@ELIEBAKOUCH

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

8:31 AM · May 16, 2026

Cluster engagement

90 snapshots

#55Lucas Beyer (bl16)@GIFFMANA

@eliebakouch @Laz4rz I really dislike the warmup, but yeah at least they change lr when changing bs in the experiment. Although... changing bs on the fly like that might be related to putting a schedule on the lr and they don't spend an experiment on that

elie@eliebakouch

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

3:31 PM · May 16, 2026 · 198 Views

3:44 PM · May 16, 2026 · 222 Views

#55Lucas Beyer (bl16)@GIFFMANA

@eliebakouch @Laz4rz That's what i mean, changing batch changes lr implicitly.

elie@eliebakouch

@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too. on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay

4:01 PM · May 16, 2026 · 270 Views

5:01 PM · May 16, 2026 · 83 Views

#716elie@ELIEBAKOUCH

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

3:31 PM · May 16, 2026 · 198 Views

#716elie@ELIEBAKOUCH

@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too.

on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay

Lucas Beyer (bl16)@giffmana

3:44 PM · May 16, 2026 · 222 Views

4:01 PM · May 16, 2026 · 270 Views