Researchers Boris Hanin, William Barr Held, and Percy Liang validate a scaling law predicting pre-training loss for a 129-billion-parameter MoE model
The final 2.234 loss closely matched their 2.252 prediction.
Saw this cool post by Percy: https://x.com/percyliang/status/2058621601542009341 and it reminded me of the QT'd paper.
Question: what have we learnt about how to interpret pretrainig loss over the past two years? Any good papers I should add to the neverending list?
Remarkable results! So exciting.
Congrats @BorisHanin and @WilliamBarrHeld, @percyliang
@DARPA AIQ program!
Incredible predictability for pre-training loss across a more than 100x scaling up of compute Big congrats to @WilliamBarrHeld and @percyliang HP transfer / parameterization based in part on our work with @CPehlevan @blake__bordelon and Tianze Jiang Part of @DARPA AIQ run by @patrickshafto
@sebkrier I loved Allen-Zhu's blitz around the same time https://arxiv.org/pdf/2404.05405

Saw this cool post by Percy: https://x.com/percyliang/status/2058621601542009341 and it reminded me of the QT'd paper. Question: what have we learnt about how to interpret pretrainig loss over the past two years? Any good papers I should add to the neverending list?