Hi @JitendraMalikCV Nice to hear from you. The original paper is from 2018. @notmisha was ahead of the game.
My team was always interested in sequence prediction as a path to intelligence. In 2010, we developed spatio-temporal convnets for video and wrote: future improvements in image representations will, instead, be driven by efficient and robust algorithms that learn to extract hierarchical, distributed feature representations from images and video in a fully unsupervised manner. https://www.cs.ubc.ca/~nando/papers/nipsworkshop2010.pdf - I presented this work at Neurips and I still remember you in the audience stating that we needed more serious data. You were right. We needed the GPUs too. Soon enough we were doing GPU summer schools with @CIFAR_News in @UofTCompSci and students in the audience figured out how to code convnets on GPUs so they could run them on bigger datasets like imagenet.
Back then, we had already tried multimodal language models in https://arxiv.org/pdf/1108.3298 — there we wrote: Detecting temporal patterns and predicting into the future is a fundamental problem in machine learning. It has gained great interest recently in the areas of nonparametric Bayesian statistics (Wood et al., 2009) and deep learning (Sutskever et al., 2011), with applications to several domains including language modeling and unsupervised learning of audio and video sequences. Some researchers have argued that sequence prediction is key to understanding human intelligence (Hawkins and Blakeslee, 2005). The close connections between sequence prediction and data compression are perhaps under appreciated within the machine learning community.
Later at @GoogleDeepMind @scott_e_reed proposed Few-shot Autoregressive Density Estimation https://arxiv.org/pdf/1710.10304 which @OpenAI executed much much effectively later when they wrote the GPT3 paper.
@NandoDF Nando, Here is a paper from NeurIPS 2024 https://humanoid-next-token-prediction.github.io/