Paper presents empirical method for critical batch size
An arXiv paper by William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi introduces an empirical method for measuring critical batch size during transformer language model pretraining. The work identifies the largest effective batch size before gradient noise scale yields diminishing returns. Experiments adjust learning rates across batch sizes and examine warmup phases to preserve token efficiency and model accuracy.
@eliebakouch @Laz4rz I really dislike the warmup, but yeah at least they change lr when changing bs in the experiment. Although... changing bs on the fly like that might be related to putting a schedule on the lr and they don't spend an experiment on that
@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971
@eliebakouch @Laz4rz That's what i mean, changing batch changes lr implicitly.
@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too. on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay
@giffmana @Laz4rz nice one on this topic imo https://arxiv.org/abs/2505.23971
@giffmana @Laz4rz oh why? deepseek and a bunch of other big llms do batch size warmup too.
on the lr schedule, imo it's basically a fixed lr, so the lr change from batch scaling dominates the lr change from decay

@eliebakouch @Laz4rz I really dislike the warmup, but yeah at least they change lr when changing bs in the experiment. Although... changing bs on the fly like that might be related to putting a schedule on the lr and they don't spend an experiment on that