Nvidia's Cosmos 3: 1 model that can understand, simulate, and act across many physical AI tasks.
It treats action as a first-class language of the world.
Most AI models look at reality from the outside: images become captions, videos become descriptions, and motion becomes something to label after the fact.
Cosmos 3 tries to collapse that distance by putting language, image, video, audio, and action into one shared system, so a robot can connect what it sees with what might happen next and what it should do.
A home robot cannot simply recognize a plate, a table, and a human instruction, because the useful question is what changes when it moves, grasps, slips, bumps, or waits.
That is why the paper’s action-token design matters: it turns movement into something the model can condition on, infer from video, or generate alongside a future scene.
----
Link – arxiv. org/abs/2606.02800
Title: "Cosmos 3: Omnimodal World Models for Physical AI"




