15h ago

Paper presents empirical method for critical batch size

0

An arXiv paper by William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi introduces an empirical method for measuring critical batch size during transformer language model pretraining. The work identifies the largest effective batch size before gradient noise scale yields diminishing returns. Experiments adjust learning rates across batch sizes and examine warmup phases to preserve token efficiency and model accuracy.

Original post

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

8:31 AM · May 16, 2026 View on X

@eliebakouch @Laz4rz I really dislike the warmup, but yeah at least they change lr when changing bs in the experiment. Although... changing bs on the fly like that might be related to putting a schedule on the lr and they don't spend an experiment on that

elieelie@eliebakouch

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

3:31 PM · May 16, 2026 · 198 Views
3:44 PM · May 16, 2026 · 222 Views

@eliebakouch @Laz4rz That's what i mean, changing batch changes lr implicitly.

elieelie@eliebakouch

@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too. on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay

4:01 PM · May 16, 2026 · 270 Views
5:01 PM · May 16, 2026 · 83 Views

@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971

3:31 PM · May 16, 2026 · 198 Views

@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too.

on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

@eliebakouch @Laz4rz I really dislike the warmup, but yeah at least they change lr when changing bs in the experiment. Although... changing bs on the fly like that might be related to putting a schedule on the lr and they don't spend an experiment on that

3:44 PM · May 16, 2026 · 222 Views
4:01 PM · May 16, 2026 · 270 Views
Paper presents empirical method for critical batch size · Digg