/AI11h ago

Technical document shows MAI used a global batch size of nearly 1 billion tokens during final RL training

Engineer Will Brown calculated the training scale is feasible.

1920876115.4K

Original posts

#987

Quote posts

#340

Comments

#340

Original post

Grad@Grad62304977#987inAI

Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage

4:38 AM · Jun 4, 2026 · 11.7K Views

/AI11h ago

Technical document shows MAI used a global batch size of nearly 1 billion tokens during final RL training

Engineer Will Brown calculated the training scale is feasible.

--0--

Original posts

#987

Quote posts

#340

Comments

#340

Original post

Grad@Grad62304977#987inAI

Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage

4:38 AM · Jun 4, 2026 · 11.7K Views

Sentiment

Positive users praise MAI's large RL batch sizes as effective and scalable while negative users worry the async approach feels fragile due to off-policy fixes.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.7KLIKES39RETWEETS2

Grad@Grad62304977

In general RL seems to have a very different batch size scaling than pretraining Also even in their previous shorter context stages, the batch sizes were always bigger or equal to the pretraining batch size

11h2.7K392

BOOKMARKS9REPLIES3

will brown@willccbb

@Grad62304977 seems about right?

if rollouts are 100k tokens then 10k rollouts per batch seems reasonable for variance reduction + balances well w frontier scale infra / run times

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS9LIKES26RETWEETS1REPLIES3

will brown@willccbb

@Grad62304977 seems about right?

if rollouts are 100k tokens then 10k rollouts per batch seems reasonable for variance reduction + balances well w frontier scale infra / run times

will brown@willccbb

napkin-math fermi estimates for large-scale RL runs:

assume comms are free, everything fully async, and that O(100)-step staleness is fine, but coupled with research iteration speed. you want p99 rollouts to finish fast enough that overnight ablations show signal.

then for prod runs, you have one clean knob: trainer GPU allocation vs wall-clock time. theoretical limit is 3:1 if inference is FLOP-bound, but more realistically it's 1:3 or 1:4 as a practical sweet spot.

you might also assume that A5B ablations should show clear signal in a few hours for the env distribution, prod run is A50B MoE and p99 rollouts are a few hours, 100K GPUs, 100K tokens/rollout, everything FP4 w 30% MFU

what this very roughly gets you is that a 4-week prod run might look something like 100K steps, each with 100K rollouts, optimistically. in practice there's prob 0.5-1 OOMs of overhead which creep in somewhere, pulling down both your batch size + step count a bit.

i think this feels directionally correct? any other fun heuristics people like using here?

4h2.4K269