/Tech17h ago

Nvidia Releases Cosmos 3 Omnimodal World Model for Physical AI

64718203.3K

#1257

Original post

Rohan Paul@rohanpaul_ai#1257inTech

Nvidia's Cosmos 3: 1 model that can understand, simulate, and act across many physical AI tasks.

It treats action as a first-class language of the world.

Most AI models look at reality from the outside: images become captions, videos become descriptions, and motion becomes something to label after the fact.

Cosmos 3 tries to collapse that distance by putting language, image, video, audio, and action into one shared system, so a robot can connect what it sees with what might happen next and what it should do.

A home robot cannot simply recognize a plate, a table, and a human instruction, because the useful question is what changes when it moves, grasps, slips, bumps, or waits.

That is why the paper’s action-token design matters: it turns movement into something the model can condition on, infer from video, or generate alongside a future scene.

----

Link – arxiv. org/abs/2606.02800

Title: "Cosmos 3: Omnimodal World Models for Physical AI"

7:06 AM · Jun 13, 2026 · 3.3K Views

Sentiment

Positive users praise Nvidia's Cosmos 3 for integrating language image video audio and action as a breakthrough in physical AI while negative users dismiss it as merely a better imitator that's inaccurate.

Pos

50.0%

Neg

50.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

sora@varmology

@rohanpaul_ai Have you tried it? It’s inaccurate

16h42

LIKES3

Shinka - AI@ShinkaIoT

@rohanpaul_ai The shift from describing reality to directly acting within it is the real breakthrough for physical AI.

9h93

RETWEETS16

Rohan Paul@rohanpaul_ai

Nvidia's Cosmos 3: 1 model that can understand, simulate, and act across many physical AI tasks.

It treats action as a first-class language of the world.

Most AI models look at reality from the outside: images become captions, videos become descriptions, and motion becomes something to label after the fact.

A home robot cannot simply recognize a plate, a table, and a human instruction, because the useful question is what changes when it moves, grasps, slips, bumps, or waits.

That is why the paper’s action-token design matters: it turns movement into something the model can condition on, infer from video, or generate alongside a future scene.

----

Link – arxiv. org/abs/2606.02800

Title: "Cosmos 3: Omnimodal World Models for Physical AI"

17h3.3K4720

Pode vir@thiagoTF

@rohanpaul_ai action as a langage. cute. still just a better immitator tho. where the signal for what humans actually want this thing to do

17h37

That AI Guy@LewisWeldtech

@rohanpaul_ai https://www.academia.edu/168584083/Universal_Dust_Theory_The_Lewis_Conjecture_2?source=swp_share

7h3

Abdulrashid | AI & Robotics@AIwithImran

@rohanpaul_ai Cosmos 3: language, image, video, audio, action in one model. Action as a first-class token, not a post-hoc label. Robot sees plate → infers slip probability → adjusts grip. That's the gap between recognition and manipulation.

13h