Chris Paxton, Agility Robotics AI lead, outlines seven robotics data collection methods while engineer kache argues scaling relies on simulation
Story Overview
Frontier robotics models cannot rely on passive web scraping the way language models did. Every trajectory, torque reading, and tactile signal must be captured through actual hardware interaction, turning data collection into the central scaling constraint and forcing teams to weigh cost, fidelity, and embodiment gaps across multiple deliberate methods.
Blending heterogeneous sources still lacks proven recipes
Even after data arrives, stitching teleoperation logs, fleet runs, simulation rollouts, and video sources into one coherent training signal remains the harder unsolved step, with no public benchmarks yet showing how to balance quality against scale.
Deployment scale alone won't unlock the next leap
Fleet data only becomes useful once robots are already operating at volume, creating a chicken-and-egg limit that pure simulation or video approaches have not yet cleared for contact-rich tasks.
Some users endorse simulation and hybrid methods for the robotics data bottleneck because they generate abundant training data, while others argue hardware is the real problem and dismiss simulation-heavy approaches.
Most Activity
wrong. it's simulators

@yacineMTB Closing sim to real gap ?

@chris_j_paxton Would serve robotics data collection apply here? I believe they are already selling data if i’m not mistaken

@yacineMTB I agree that it's just simulators, but at the same time, why don't humans, or bugs for that matter, need simulators? Are the weights already in the genes? They would have to be, wouldn't they. This would mean evolution is a sim.

@yacineMTB I fully endorse this view. The robot needs a "mind's eye" and that doesn't need to be terribly "high-token-resolution." A "simple abstraction" layer seems to be missing.

@yacineMTB simulation creates the data at millions of SPS, it creates such a firehose of information smaller and smaller models can not only learn to do it, they can do it better because inference is 10X the speed of larger models.

@yacineMTB Rl directly to hardware, skip the sim https://arxiv.org/abs/2206.14176

@yacineMTB *Accurate simulators.

@yacineMTB @ChidubemNdukwe are you sober bro 😭?

@yacineMTB @ChidubemNdukwe 100%, there are limited ways of moving through 3d space, which you can actually simulate

@yacineMTB Aren't they good enough? I mean look as IsaacSim

@yacineMTB IRL is the ultimate simulator.

@yacineMTB Sim is always a low fidelity model of the world and therefore retarded by definition.
Large scale deployed systems mostly rely on real data collection flywheels.

@Anteejay @yacineMTB hi

@yacineMTB always has been
that's why Dr Jim Fan at vidya is gonna win

@ChidubemNdukwe sim2real is trivial

@chris_j_paxton Always was wondering why there was such a little focus on hybrid approaches. I'm confident the best outcome is finding the right mix of the different types of data.

@yacineMTB Hmm... what's your threshold for trivial? are you saying the gap is closed for specific task classes, or broadly?

@yacineMTB simulator + architecture (+ training framework) the "we need more data" gang of robotics is so retarded they just look at the gpt story and slap it onto robotics without a hint of thoughts