@BlancheMinerva If you already know P and D, then do 2.
If you are trying to decide P and D, then it's slightly more complicated.
I'm training a big model w/ P parameters and D tokens. Before I do so, I train smaller models to estimate the performance of the big model. Which of the following scaling regimes should I use to get the best predictions? 1. Fix D across all models 2. Fix P/D 3. Something else