1/ Now that we're running out of data, how do you optimally scale multi-epoch pretraining to hundreds of epochs?
Our first paper from Q! q0 trains a population of models, instead of single model that saturates fast, reaching a dramatically lower loss at *every* epoch budget.
w/ @bishmdl76 @akshayvegesna @ShmuelBerman
