Technical document shows MAI used a global batch size of nearly 1 billion tokens during final RL training
Engineer Will Brown calculated the training scale is feasible.
Most Activity
@Grad62304977 seems about right?
if rollouts are 100k tokens then 10k rollouts per batch seems reasonable for variance reduction + balances well w frontier scale infra / run times
napkin-math fermi estimates for large-scale RL runs:
assume comms are free, everything fully async, and that O(100)-step staleness is fine, but coupled with research iteration speed. you want p99 rollouts to finish fast enough that overnight ablations show signal.
then for prod runs, you have one clean knob: trainer GPU allocation vs wall-clock time. theoretical limit is 3:1 if inference is FLOP-bound, but more realistically it's 1:3 or 1:4 as a practical sweet spot.
you might also assume that A5B ablations should show clear signal in a few hours for the env distribution, prod run is A50B MoE and p99 rollouts are a few hours, 100K GPUs, 100K tokens/rollout, everything FP4 w 30% MFU
what this very roughly gets you is that a 4-week prod run might look something like 100K steps, each with 100K rollouts, optimistically. in practice there's prob 0.5-1 OOMs of overhead which creep in somewhere, pulling down both your batch size + step count a bit.
i think this feels directionally correct? any other fun heuristics people like using here?
