1d ago

Researchers Boris Hanin, William Barr Held, and Percy Liang validate a scaling law predicting pre-training loss for a 129-billion-parameter MoE model

The final 2.234 loss closely matched their 2.252 prediction.

0
Original post

Incredible predictability for pre-training loss across a more than 100x scaling up of compute Big congrats to @WilliamBarrHeld and @percyliang HP transfer / parameterization based in part on our work with @CPehlevan @blake__bordelon and Tianze Jiang Part of @DARPA AIQ run by @patrickshafto

11:28 AM · May 28, 2026 View on X

Saw this cool post by Percy: https://x.com/percyliang/status/2058621601542009341 and it reminded me of the QT'd paper.

Question: what have we learnt about how to interpret pretrainig loss over the past two years? Any good papers I should add to the neverending list?

8:29 PM · May 28, 2026 · 2.6K Views

Remarkable results! So exciting.

Congrats @BorisHanin and @WilliamBarrHeld, @percyliang

@DARPA AIQ program!

Boris HaninBoris Hanin@BorisHanin

Incredible predictability for pre-training loss across a more than 100x scaling up of compute Big congrats to @WilliamBarrHeld and @percyliang HP transfer / parameterization based in part on our work with @CPehlevan @blake__bordelon and Tianze Jiang Part of @DARPA AIQ run by @patrickshafto

6:28 PM · May 28, 2026 · 8.4K Views
6:45 PM · May 28, 2026 · 757 Views

@sebkrier I loved Allen-Zhu's blitz around the same time https://arxiv.org/pdf/2404.05405

Séb KrierSéb Krier@sebkrier

Saw this cool post by Percy: https://x.com/percyliang/status/2058621601542009341 and it reminded me of the QT'd paper. Question: what have we learnt about how to interpret pretrainig loss over the past two years? Any good papers I should add to the neverending list?

8:29 PM · May 28, 2026 · 2.6K Views
8:43 PM · May 28, 2026 · 299 Views