/Tech4d ago

Computer vision expert Jitendra Malik argues robotics researchers should prioritize physical sensorimotor manipulation over vision-language-action models

DeepMind's Shane Gu endorsed the advice, citing assembly research.

1393.1K3921.3K466.8K
Original post
Jitendra MALIK@JitendraMalikCV#922inTech

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

2:01 PM · Jun 5, 2026 · 466.8K Views
Sentiment

Many users strongly endorse Jitendra Malik's push to prioritize sensorimotor manipulation and spatial memory over VLMs in robotics because flashy vision-only demos rarely deliver practical results while real contact and force challenges do.

Pos
88.9%
Neg
11.1%
20 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS47.9K

It’s looking like time for Jitendra to change his twitter handle!

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

3dViews 47.9KLikes 152Bookmarks 32
BOOKMARKS185
Shane Gu@shaneguML

Appreciate Jitendra's takes on world models/VLMs. His word below is why back in 2019-2021, instead of VLAs for simple pick-and-place, we chose assembly.

Dexterity = mutual info between your intent and forces/torques on objects via contacts.

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 39.8KLikes 220Bookmarks 185
LIKES254
Yann LeCun@ylecun

@JitendraMalikCV Exactly. Also, you can't act without the ability to predict the consequences of your actions. Also, VLAs are dead.

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 18.6KLikes 254Bookmarks 48
RETWEETS388
Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 466.8KLikes 3.1KBookmarks 1.3K
REPLIES12
Chris Paxton@chris_j_paxton

There are two core "software" problems worth solving in robotics in my opinion: - end to end learning of dexterous manipulation skills - dynamic, long horizon spatial memory which can interact with the above

As a field we're currently very focused on the first because, well, it's what work the best with current techniques and it produces flashier demos. And for assembly lines it's really the one you need.

But the real long tail of robotics work will actually require both, and I really know very few teams that have strong expertise in both things.

My impression is that the SECOND problem is the harder one, because if it was easy we'd have useful AR glasses on the market

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 16.4KLikes 116Bookmarks 71
Shane Gu@shaneguML

w/ Satoshi, @IMordatch , @coolboi95 (founding engineer at Generalist), Michael, Luke Metz, Dan Freeman, originally w @Vikashplus. great memories from back in research days at Google Brain Robotics.

Real robot link: https://sites.google.com/view/u-shape-block-assembly Sim link: https://sites.google.com/view/learning-direct-assembly

Shane Gu@shaneguML

Appreciate Jitendra's takes on world models/VLMs. His word below is why back in 2019-2021, instead of VLAs for simple pick-and-place, we chose assembly.

Dexterity = mutual info between your intent and forces/torques on objects via contacts.

4dViews 3.3KLikes 12Bookmarks 12
Wenhu Chen@WenhuChen

Such a good advice!

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 6.2KLikes 15Bookmarks 4

@JitendraMalikCV @jon_barron 💯

Jitendra MALIK@JitendraMalikCV

I want to offer some unsolicited advice to computer vision researchers jumping into robotics. Don't focus too much on VLMs, VLAs etc. That's fine, but the real action is at the sensorimotor level. Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision. Don't get seduced by cherry-picked demos. You can't do robotics without doing robotics.

4dViews 4.3KLikes 4Bookmarks 2
Asuka Zheng🎀@VoidAsuka

I agree. Let's first figure out what necessary modalities are required to make the policy work, focusing on improving the sensors first so we can obtain high-quality data.

By then, LLMs will be strong enough (or are already strong enough) to figure out the best model architecture for a robotics foundation model recurisively.

From first principles, robotics data shouldn't consist solely of vision-action data; it isn't like autonomous driving. It only looks that way right now because the field is currently dominated by computer vision and autonomous driving researchers.

4dViews 2.8KLikes 11Bookmarks 1
Jitendra MALIK@JitendraMalikCV

@chrmanning Haha!

It’s looking like time for Jitendra to change his twitter handle!

3dViews 2.2KLikes 12Bookmarks 0
Jay@Proception@JayLiStanford

@JitendraMalikCV 💯💯 "Most of the open problems in robotics are in manipulation, which is about hand-object interaction, and contacts and forces are central. Proprioception and tactile sensing are as important as vision."

4dViews 935Likes 3Bookmarks 1
Michelle@michellelsun

The teams that are making the most progress are logging contact hours in messy, real-world environments, not cutting demo reels. And the gap is widening.

Tactile and proprioception is also where the hardware iteration shows up. Shenzhen vendors turn actuator and tactile-sensor revisions fast, but hand-object contact data stays the bottleneck. The demo was always the easy part.

2dViews 545Bookmarks 1
Kekko D’Amato@kekkodamato_

Strong take. VLMs get all the attention because they're easy to benchmark, but the sensorimotor gap is where robotics actually breaks down. Understanding a scene is unsolved; reliably *touching* the right part of it with the right force at the right moment is a different problem entirely.

4dViews 183Bookmarks 1

@chris_j_paxton The second problem is mostly harder due to the lack of benchmarks.

4dViews 1.4KLikes 1

@chris_j_paxton Oh, yes, yes. I know, that's the main reason why there is such a few progress.

4dViews 124Likes 2
Uğur Yekta Başak@uguryektabasak

@chris_j_paxton it looks niantic is in good position to solve spatial memory (they previously called it AR cloud, so you are right about your AR glasses analogy). @asimahmed shares great demonstrations recently

3dViews 150Likes 1
Rayhaan_solo@Rayhaan87049059

@JitendraMalikCV I agree completely. When I started my master's, I was I agree completely. Starting my master's, I was optimistic that VLAs would solve my problem, particularly for humanoids playing chess.

3dViews 326
Chris Paxton@chris_j_paxton

@sentientcar I think context length will be part of the solution for sure but there's fundamentally a data problem here too

4dViews 70Likes 1
Chris Paxton@chris_j_paxton

@roeiherzig People have tried! I worked on this once upon a time. Its much harder than you think.

4dViews 156
Paul Woodward@paul_v_woodward

@chris_j_paxton Agree 100% been working on memory for a while now @dwell_bot

4dViews 41Likes 1
Load more posts