TU Darmstadt's Georgia Chalvatzaki argues physical contact makes robot action spaces complex, while Jon Barron defends end-to-end image-to-action learning

VIEWS23.4KBOOKMARKS28LIKES62

Yu Xiang@YuXiang_IRVL

Have to disagree that robotics should not care about 3D.

Robots operate in a 3D world. Hand-eye calibration, 3D perception, grasp planning, motion planning, and contact-rich manipulation all rely on 3D geometry.

3D remains fundamental to robotics.

Jon Barron@jon_barron

I'll be giving a "exactly how bitter lesson'ed is all of 3D computer vision?" talk at the Bitter Lessons CVPR workshop tomorrow at 10:30am, Room 3A-3D. Here's a slide with the overall thesis.

26d23.4K6228

RETWEETS15REPLIES3

Kostas Daniilidis@KostasPenn

@GeorgiaChal @jon_barron @vincesitzmann @anand_bhattad #bitterlesson The low-dimension of the action space does not make the problem easy. What matters is the structure of the latent manifold and what this encodes. 3D representations are efficient because they are invariant to nuisances like viewpoint, illumination, and appearance in pixel spaces. With infinite pixel2action episodes from behavior cloning or RL these representations may eventually emerge the way they emerged during evolution (probe animal brains and you will see equivariant 3D representations, work by Logothetis, Poggio and others). If the task were as simple as @jon_barron claims, no 3D or flow or stereo would have emerged in biological brains. Until we operate in evolution-like gpt-like data, hard-coding geometric structure is the best inductive bias we can apply. Evolution converged to geometry.

Georgia Chalvatzaki@GeorgiaChal

@jon_barron You’ve put your finger on the real question there… where structure enters, modular vs end-to-end... That’s exactly what we’re debating, so I’ll save the long response for the panel. 🙂

25d9.6K5819

Jon Barron@jon_barron

I don't think we disagree at all, except for the "we should hard code geometric structure bit". As you said, 3D representations almost certainly emerge in large models trained on actions or pixels. But the slide asks if *you* (the AI researcher) should care about 3D --- the network definitely needs to care, but these are very different questions.

Kostas Daniilidis@KostasPenn

@GeorgiaChal @jon_barron @vincesitzmann @anand_bhattad #bitterlesson The low-dimension of the action space does not make the problem easy. What matters is the structure of the latent manifold and what this encodes. 3D representations are efficient because they are invariant to nuisances like viewpoint, illumination, and appearance in pixel spaces. With infinite pixel2action episodes from behavior cloning or RL these representations may eventually emerge the way they emerged during evolution (probe animal brains and you will see equivariant 3D representations, work by Logothetis, Poggio and others). If the task were as simple as @jon_barron claims, no 3D or flow or stereo would have emerged in biological brains. Until we operate in evolution-like gpt-like data, hard-coding geometric structure is the best inductive bias we can apply. Evolution converged to geometry.

25d5K2610

Jiatao Gu@thoma_gu

Agree with @KostasPenn. In fact, 3D representations give us dense, high-dimensional targets with the right geometric inductive bias — possibly better scaling targets than low-dimensional actions alone.👀

Kostas Daniilidis@KostasPenn

@GeorgiaChal @jon_barron @vincesitzmann @anand_bhattad #bitterlesson The low-dimension of the action space does not make the problem easy. What matters is the structure of the latent manifold and what this encodes. 3D representations are efficient because they are invariant to nuisances like viewpoint, illumination, and appearance in pixel spaces. With infinite pixel2action episodes from behavior cloning or RL these representations may eventually emerge the way they emerged during evolution (probe animal brains and you will see equivariant 3D representations, work by Logothetis, Poggio and others). If the task were as simple as @jon_barron claims, no 3D or flow or stereo would have emerged in biological brains. Until we operate in evolution-like gpt-like data, hard-coding geometric structure is the best inductive bias we can apply. Evolution converged to geometry.

25d4.1K219

Jon Barron@jon_barron

@GeorgiaChal oh cool, I'll try to make it!

Just to clarify, I don't think robotics is easy in absolute terms, what I'm asserting here is that the problem of predicting actions from images is easier than the problem of predicting 3D from images and then predicting actions from 3D.

Georgia Chalvatzaki@GeorgiaChal

"The space of actions is tiny" — that's what he said... A 7-DoF command is low-dimensional, but the problem isn't. Contact makes the dynamics hybrid and non-smooth; you're optimizing across manifold switches every time something touches. That hardness is the geometry being waved away.

Conveniently, the field is debating exactly this on Friday! Come weigh in: "Geometry in the Age of Data-Driven Robotics," #ICRA2026, Hall C4, Fri Jun 5. https://geometric-robotics.github.io/icra-2026-workshop/

26d2.1K121

Georgia Chalvatzaki@GeorgiaChal

You picked the one domain where you're right. Self-driving is contactless SE(2) motion, so sure, ~2D. Add contacts, e.g., a hand, a foot, and then start multiplying by 2, and the feasible set becomes hybrid and non-smooth, and sequencing smooth control tuples no longer covers it. That's the misconception: "small" conflates input dimension with the geometry of the problem. Please do not overgeneralize on the broad field of robotics problems...

26d4717

Georgia Chalvatzaki@GeorgiaChal

@jon_barron You’ve put your finger on the real question there… where structure enters, modular vs end-to-end... That’s exactly what we’re debating, so I’ll save the long response for the panel. 🙂

Jon Barron@jon_barron

@GeorgiaChal oh cool, I'll try to make it!

Just to clarify, I don't think robotics is easy in absolute terms, what I'm asserting here is that the problem of predicting actions from images is easier than the problem of predicting 3D from images and then predicting actions from 3D.

26d2.9K41

Jack Langerman ✈️ CVPR@jacklangerman

@YuXiang_IRVL do you think *explicit* 3D is required or just "3D understanding"?

26d2911

Jon Barron@jon_barron

@GeorgiaChal I don't understand, how is it not small? The action space of self-driving at any moment in time is 2-dimensional (speed and theta). The only way I can see to make it bigger is to try to predict a plan (across time) made out of (speed, theta) tuples, is that what you mean?

Georgia Chalvatzaki@GeorgiaChal

The action spacein robotics isn't small! That's the misconception of many. Contact makes the dynamics hybrid: you switch manifolds every time a contact makes or breaks, and the optimization landscape is non-smooth precisely there. A low-dimensional command vector is not a small problem. That difficulty is the SE(3) geometry you're waving away!

26d79500

Yu Xiang@YuXiang_IRVL

@jacklangerman Good question. I would say 3D understanding, no need to be explicit 3D

26d2462

Georgia Chalvatzaki@GeorgiaChal

@jon_barron @alexandertmai @KostasPenn @vincesitzmann @anand_bhattad We should really have a cross-disciplinary in-person chat about this topic. Not possible to talk all about robotics and learning via X posts :)

Jon Barron@jon_barron

@alexandertmai @KostasPenn @GeorgiaChal @vincesitzmann @anand_bhattad Yeah if you decide your output space is a plan of actions, then the story changes a lot. Do people do this in practice? Why? Seems easier to just decide what to do now, and then decide what to do at the next time step then. LLMs don't plan, they just roll out.

25d13320

Chris Liu@drxcliu

@YuXiang_IRVL @bowenwen_me Agreed. Human already embed 3D perception in vision so we are actually moving according to the space instead of two eyes frames.

26d2301

Will Hughes@woodtechwill

@YuXiang_IRVL I build stuff in CAD for fun. Can't imagine automating grasping without 3D. The geometry is the whole point.

26d122

Georgia Chalvatzaki@GeorgiaChal

@KostasPenn @jon_barron @vincesitzmann @anand_bhattad Well said! Totally agree!

25d311

Chris Paxton@chris_j_paxton

@GeorgiaChal @jon_barron And waymo does use 3D

Georgia Chalvatzaki@GeorgiaChal

You picked the one domain where you're right. Self-driving is contactless SE(2) motion, so sure, ~2D. Add contacts, e.g., a hand, a foot, and then start multiplying by 2, and the feasible set becomes hybrid and non-smooth, and sequencing smooth control tuples no longer covers it. That's the misconception: "small" conflates input dimension with the geometry of the problem. Please do not overgeneralize on the broad field of robotics problems...

26d6100

Roei Herzig ✈️ CVPR@roeiherzig

@jon_barron @GeorgiaChal I'm not sure whether action prediction is fundamentally different from 3D hand pose estimation. They could certainly be represented in the same space, and that's probably the best way to leverage human data with robots.

26d1