Microsoft’s Nando de Freitas and Jitendra Malik debate sequence prediction’s evolution from historical AI theory to humanoid robotics · Digg

/Tech4h ago

Microsoft’s Nando de Freitas and Jitendra Malik debate sequence prediction’s evolution from historical AI theory to humanoid robotics

Malik traced the theory to Claude Shannon in 1951.

2003570

Original post

Nando de Freitas@NandoDF#22inTech

Hi @JitendraMalikCV Nice to hear from you. The original paper is from 2018. @notmisha was ahead of the game.

My team was always interested in sequence prediction as a path to intelligence. In 2010, we developed spatio-temporal convnets for video and wrote: future improvements in image representations will, instead, be driven by efficient and robust algorithms that learn to extract hierarchical, distributed feature representations from images and video in a fully unsupervised manner. https://www.cs.ubc.ca/~nando/papers/nipsworkshop2010.pdf - I presented this work at Neurips and I still remember you in the audience stating that we needed more serious data. You were right. We needed the GPUs too. Soon enough we were doing GPU summer schools with @CIFAR_News in @UofTCompSci and students in the audience figured out how to code convnets on GPUs so they could run them on bigger datasets like imagenet.

Back then, we had already tried multimodal language models in https://arxiv.org/pdf/1108.3298 — there we wrote: Detecting temporal patterns and predicting into the future is a fundamental problem in machine learning. It has gained great interest recently in the areas of nonparametric Bayesian statistics (Wood et al., 2009) and deep learning (Sutskever et al., 2011), with applications to several domains including language modeling and unsupervised learning of audio and video sequences. Some researchers have argued that sequence prediction is key to understanding human intelligence (Hawkins and Blakeslee, 2005). The close connections between sequence prediction and data compression are perhaps under appreciated within the machine learning community.

Later at @GoogleDeepMind @scott_e_reed proposed Few-shot Autoregressive Density Estimation https://arxiv.org/pdf/1710.10304 which @OpenAI executed much much effectively later when they wrote the GPT3 paper.

Jitendra MALIK@JitendraMalikCV

@NandoDF Nando, Here is a paper from NeurIPS 2024 https://humanoid-next-token-prediction.github.io/

1:03 PM · Jun 27, 2026 · 307 Views

Sentiment

Users express excitement about Nando de Freitas tracing sequence prediction roots to 2010 research because it looks interesting and relevant.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

ARXIV.ORGVia

Posts from X

Most Activity

VIEWS172REPLIES1

Jitendra MALIK@JitendraMalikCV

@NandoDF @notmisha Hi @NandoDF Totally concur that sequence prediction is old. We could cite Shannon (1951) for instance. The Radosavovic et al (2024) paper develops these ideas for robotics and has experimental results for humanoid locomotion. @ir413

Nando de Freitas@NandoDF

Hi @JitendraMalikCV Nice to hear from you. The original paper is from 2018. @notmisha was ahead of the game.

My team was always interested in sequence prediction as a path to intelligence. In 2010, we developed spatio-temporal convnets for video and wrote: future improvements in image representations will, instead, be driven by efficient and robust algorithms that learn to extract hierarchical, distributed feature representations from images and video in a fully unsupervised manner. https://www.cs.ubc.ca/~nando/papers/nipsworkshop2010.pdf - I presented this work at Neurips and I still remember you in the audience stating that we needed more serious data. You were right. We needed the GPUs too. Soon enough we were doing GPU summer schools with @CIFAR_News in @UofTCompSci and students in the audience figured out how to code convnets on GPUs so they could run them on bigger datasets like imagenet.

Back then, we had already tried multimodal language models in https://arxiv.org/pdf/1108.3298 — there we wrote: Detecting temporal patterns and predicting into the future is a fundamental problem in machine learning. It has gained great interest recently in the areas of nonparametric Bayesian statistics (Wood et al., 2009) and deep learning (Sutskever et al., 2011), with applications to several domains including language modeling and unsupervised learning of audio and video sequences. Some researchers have argued that sequence prediction is key to understanding human intelligence (Hawkins and Blakeslee, 2005). The close connections between sequence prediction and data compression are perhaps under appreciated within the machine learning community.

Later at @GoogleDeepMind @scott_e_reed proposed Few-shot Autoregressive Density Estimation https://arxiv.org/pdf/1710.10304 which @OpenAI executed much much effectively later when they wrote the GPT3 paper.

3h17201

BOOKMARKS2

Nando de Freitas@NandoDF

@JitendraMalikCV @notmisha @ir413 Looking forward to reading it. It looks really interesting and relevant. You may find Misha’s development of the idea for humanoid robot hands here: https://arxiv.org/abs/1804.06318

Jitendra MALIK@JitendraMalikCV

@NandoDF @notmisha Hi @NandoDF Totally concur that sequence prediction is old. We could cite Shannon (1951) for instance. The Radosavovic et al (2024) paper develops these ideas for robotics and has experimental results for humanoid locomotion. @ir413

3h10302

Nando de Freitas@NandoDF

This is a nice paper, well executed! @scott_e_reed had this in mind when developing Gato https://arxiv.org/abs/2205.06175 — I’m glad to see the idea executed with a humanoid and I’d love to see more work along this direction. Gato stood for General AgenT One. Sadly, we weren’t able to develop General Agent Two.

Future work should also focus on more modalities — touch, proprioception and vestibular information. This is essential for real control and interaction with objects. Walking involves contact with the ground but physical manipulation of different substances, materials and objects is more complex. For this, I agree with you we need to move beyond VLAs. One hardly needs vision for manipulation.

Jitendra MALIK@JitendraMalikCV

@NandoDF @notmisha Hi @NandoDF Totally concur that sequence prediction is old. We could cite Shannon (1951) for instance. The Radosavovic et al (2024) paper develops these ideas for robotics and has experimental results for humanoid locomotion. @ir413

48m2300