to be fair: pretraining loss scales logarithmically with training compute. capabilities aren't linear in pretraining loss, though. each marginal reduction in loss gets harder and potentially unlocks more capabilities. dario's claim is still insane though
Good call out. *Everyone* knows capabilities scale logarithmically with computing power.
This is the second paragraph. How on earth does a mistake like this make it in?
