When @olivercameron was the first to teach us about world models (he just collected $300 million investment last week) in my head I was thinking: "If I drop a cup on the ground, with some water in it, and film that with a high speed camera, the world model would learn a lot about how the world works."
All sorts of physics is in one video.
Now what if you have millions of videos?
It learns everything about the world.
What happens when they go real time?
They learn about the world faster.
Like how fresh the strawberries are at the local farmer's market.
It might take a few years to get the humanoid robot generalized with one of these real time world models, but when they arrive they will be very smart about the world.
Like your robot would know everything about every Las Vegas nightclub. Happening right now.
The next 18 months in World Models will be highly entertaining.
In early July I'm attending @aclmeeting which is where the world's top language experts who build the AIs we mostly use today (LLMs).
Will be interesting to hear what they think of World Models.
My insight is that the smart ones are building hybrid systems that use both LLMs and World Models together.
When an Optimus or a Figure or a Neo walks into your life (and it will happen sometime in next decade even if I'm wrong). it will have both. LLMs for talking with you. World Models to do what you told it to do.
Your robot is living in a weird digital simulation of the real world that's laid on top of the real world. Including over your face and hands.
Will do my best to share what I learn by going other than I know very little about AI compared to them.
“Rest in Peace, VLAs”, NVIDIA’s robotics lead @DrJimFan said, at the Sequoia’s AI Ascent 2026 conference. So, what’s next?
Here’s Jim Fan’s core argument:
VLA (Vision Language Action Model) architectures are fundamentally brittle; they merely bolted robotic actions onto LLMs.
Instead, the industry is converging on physics-grounded World Models.
When it comes to robotics data, sample efficiency and data architecture are replacing brute-force token volume.
Look at how the unit economics of data collection just shifted through two recent breakthroughs:
- @1x_tech trained its NEO humanoid world model to execute out-of-distribution tasks using just 900 hours of egocentric human video and a mere 70 hours of real robot data (Jan 2026)
- @nvidia shipped Cosmos 3, demonstrating that with a strong world foundation model, just 100 teleop seed samples are enough to post-train a complete, action-conditioned forward dynamics pipeline. (Jun 2026)
By utilizing world models, robots learn not by memorizing millions of environments, but through an implicit, internalized understanding of physics.
Pre-trained world models are now sophisticated enough to execute zero-shot tasks out-of-the-box. They then try them in the wild, and instantly convert those real-world interactions into clean, autonomous training tokens.
Instead of racing to collect the most data, the winning recipe is now sample efficiency.
And beneath that sits the model architecture that turns the fewest training examples into the most action.



