/AI11h ago

Technical document shows MAI used a global batch size of nearly 1 billion tokens during final RL training

Engineer Will Brown calculated the training scale is feasible.

--0--
Original posts
Quote posts
Comments
Original post
Grad@Grad62304977#987inAI

Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage

4:38 AM · Jun 4, 2026 · 11.7K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.4KBOOKMARKS9LIKES26RETWEETS1REPLIES3
will brown@willccbb

@Grad62304977 seems about right?

if rollouts are 100k tokens then 10k rollouts per batch seems reasonable for variance reduction + balances well w frontier scale infra / run times

will brown@willccbb

napkin-math fermi estimates for large-scale RL runs:

assume comms are free, everything fully async, and that O(100)-step staleness is fine, but coupled with research iteration speed. you want p99 rollouts to finish fast enough that overnight ablations show signal.

then for prod runs, you have one clean knob: trainer GPU allocation vs wall-clock time. theoretical limit is 3:1 if inference is FLOP-bound, but more realistically it's 1:3 or 1:4 as a practical sweet spot.

you might also assume that A5B ablations should show clear signal in a few hours for the env distribution, prod run is A50B MoE and p99 rollouts are a few hours, 100K GPUs, 100K tokens/rollout, everything FP4 w 30% MFU

what this very roughly gets you is that a 4-week prod run might look something like 100K steps, each with 100K rollouts, optimistically. in practice there's prob 0.5-1 OOMs of overhead which creep in somewhere, pulling down both your batch size + step count a bit.

i think this feels directionally correct? any other fun heuristics people like using here?

4hViews 2.4KLikes 26Bookmarks 9