@natashajaques 😏
Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.
I also thought this was pretty telling: - pre-training: 30 trillion tokens - mid-training (SFT on STEM/math/code data): 3.55 trillion tokens - RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy.
Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it's not the volume of learning signal, it's the optimization on top of an already reasonable predictive model.


