/AI18h ago

Computer vision expert Jitendra Malik argues robotics researchers should prioritize physical sensorimotor manipulation over vision-language-action models

DeepMind's Shane Gu endorsed the advice, citing assembly research.

741.9K218820161.5K
Original post
Jitendra MALIK@JitendraMalikCV#857inAI

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

2:01 PM · Jun 5, 2026 · 122.7K Views
Sentiment

Users strongly endorse robotics experts' call to prioritize sensorimotor manipulation and physical interaction skills over VLMs, viewing the latter as too slow, expensive, and disconnected from real-world contacts, forces, and deployment.

Pos
100.0%
Neg
0.0%
11 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS33.6KBOOKMARKS155LIKES191REPLIES5
Shane Gu@shaneguML

Appreciate Jitendra's takes on world models/VLMs. His word below is why back in 2019-2021, instead of VLAs for simple pick-and-place, we chose assembly.

Dexterity = mutual info between your intent and forces/torques on objects via contacts.

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

14hViews 33.6KLikes 191Bookmarks 155
RETWEETS151
Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

18hViews 122.7KLikes 1.7KBookmarks 662
Chris Paxton@chris_j_paxton

There are two core "software" problems worth solving in robotics in my opinion: - end to end learning of dexterous manipulation skills - dynamic, long horizon spatial memory which can interact with the above

As a field we're currently very focused on the first because, well, it's what work the best with current techniques and it produces flashier demos. And for assembly lines it's really the one you need.

But the real long tail of robotics work will actually require both, and I really know very few teams that have strong expertise in both things.

My impression is that the SECOND problem is the harder one, because if it was easy we'd have useful AR glasses on the market

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

1hViews 2KLikes 15Bookmarks 10
Shane Gu@shaneguML

w/ Satoshi, @IMordatch , @coolboi95 (founding engineer at Generalist), Michael, Luke Metz, Dan Freeman, originally w @Vikashplus. great memories from back in research days at Google Brain Robotics.

Real robot link: https://sites.google.com/view/u-shape-block-assembly Sim link: https://sites.google.com/view/learning-direct-assembly

Shane Gu@shaneguML

Appreciate Jitendra's takes on world models/VLMs. His word below is why back in 2019-2021, instead of VLAs for simple pick-and-place, we chose assembly.

Dexterity = mutual info between your intent and forces/torques on objects via contacts.

14hViews 1.5KLikes 5Bookmarks 4
Wenhu Chen@WenhuChen

Such a good advice!

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

8hViews 3.8KLikes 11Bookmarks 3
Asuka Zheng🎀@VoidAsuka

I agree. Let's first figure out what necessary modalities are required to make the policy work, focusing on improving the sensors first so we can obtain high-quality data.

By then, LLMs will be strong enough (or are already strong enough) to figure out the best model architecture for a robotics foundation model recurisively.

From first principles, robotics data shouldn't consist solely of vision-action data; it isn't like autonomous driving. It only looks that way right now because the field is currently dominated by computer vision and autonomous driving researchers.

15hViews 2.8KLikes 11Bookmarks 1
Jay@Proception@JayLiStanford

@JitendraMalikCV 💯💯 "Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision."

17hViews 935Likes 3Bookmarks 1

@JitendraMalikCV @jon_barron 💯

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

7hViews 2.5KLikes 2Bookmarks 1
Kekko D’Amato@kekkodamato_

Strong take. VLMs get all the attention because they're easy to benchmark, but the sensorimotor gap is where robotics actually breaks down. Understanding a scene is unsolved; reliably *touching* the right part of it with the right force at the right moment is a different problem entirely.

13hViews 183Bookmarks 1
Yas@c_yass_ab

@JitendraMalikCV @CSProfKGD VLM are computationally expensive and slow for action recognition, yet i don’t know why the vision community is obsessed with them rather than improving the vision models.

6hViews 671Likes 2

@JitendraMalikCV Well, VLAs are actually working quite well in terms of generalization, most of them are also integrating flow matching/diffusion for trajectory/controls as well, not to mention the amount of work going on in representing proprioceptive states and other sensing modalities.

15hViews 1.5KLikes 1
GeekPark@GeekParkHQ

@JitendraMalikCV i've seen so many VLA demos that look incredible and then ship absolutely nothing. meanwhile Figure at BMW: 90k parts, 30k cars, 99%+ placement. -that's years of boring sensorimotor data and brutal evals. the unsexy stuff is the stuff that works, every time, it's almost annoying

11hViews 1.2KLikes 1

@JitendraMalikCV Computer vision already is a flawed terminology. It should be called computer perception.

16hViews 1.2KLikes 1
Delta, Dirac@DeltaClimbs

@JitendraMalikCV @et_tu_deux Does extreme robot arm lightweighting help with this at all?

16hViews 1.5K
andrewyu@andrewcyu

@JitendraMalikCV @jon_barron very helpful thank you. curious how you think about interoception in relation to tactile sensing and proprioception?

16hViews 1.3K
Awais@iAwaisRauf

@JitendraMalikCV What textbook would you recommend for diving deeper into the sensorimotor side of robotics?

16hViews 1.2K
Abhinav Palle@TetraxYT

@JitendraMalikCV You need all of them. Vision included.

16hViews 1.1K
AIMathematician@CustomAIMath

@JitendraMalikCV ive simplified the 40 steps PDE cluster fk of a system

and worst i dont do robotics lol

13hViews 324Likes 1
Cooper@Jake32651324

@JitendraMalikCV I think there are more AMRs out there in comparison to manipulators. Real action might be there.

16hViews 962
Andy Wojcicki@pretendsmarts

@JitendraMalikCV without a language model in the loop, how will the robot know to say "oops" when it drops something?

9hViews 302Likes 1
Load more posts