In general RL seems to have a very different batch size scaling than pretraining Also even in their previous shorter context stages, the batch sizes were always bigger or equal to the pretraining batch size
Interestingly didn’t see anyone talking abt this but MAI used a batch size of almost 1B tokens during their final RL stage
