LLM community slowly rediscovering what we in vision found out over half a decade ago. MY SCHMIDHUBER MOMENT IS COMING!
Source: S4L paper where i tuned the most sota 10% and 1% ImageNet baselines ever, by far. https://arxiv.org/abs/1905.03670
for people wondering how frontier labs can scale to hundreds of trillions of tokens: just crank weight decay ALL THE WAY UP and keep grinding on the same dataset, silly! Lots of other details on distillation, ensembling, synthetic data too No, tokens won't be a wall












