Ethan He, former xAI world model lead, compares current video AI to early autocomplete and predicts LLMs will control interactive video environments

VIEWS43.5KBOOKMARKS205LIKES201RETWEETS13REPLIES34

This pod was an incredible gift to the community:

not only our first pod about @xAI, but Ethan really indulged on all our questions on how to train a SOTA Videogen world model, including specific areas (consistent extending/editing, voice) that Grok @Imagine is *still* SOTA,

on top of the factual overviews he ALSO came loaded with opinions/predictions:

- why he's quitting Videogen for LLMs: video models get most of their intelligence from LLMs, not from scaling video data - why the next frontier for videogen also happens to be video agent models - agentic models trained to orchestrate video models - why deterministic compression (like MP4) is a useless target vs VAE compression - Videomaxxing: if you truly believe in the "Moore's law" of AI/genmedia, then video models become the final boss UI of everything, like Flipbook (below)

Latent.Space@latentspacepod

🆕Grok Imagine’s Video Agent Moment: Cosmos, xAI, World Models, Generative UI, & the Codex Phase for Video!

https://www.latent.space/p/video-agents

@EthanHe_42, former @xai world model lead and @nvidia Cosmos researcher, explains why AI video may follow the same path as coding agents, how Grok Imagine went from zero to one, why text-to-video is only the autocomplete phase, how world models become real-time and interactive, why language models may become the control layer for video, and why the future of AI video may look less like a prompt box and more like an agent with a camera, editor, timeline, and tool belt.

28d43.5K201205

Latent.Space@latentspacepod

🆕Grok Imagine’s Video Agent Moment: Cosmos, xAI, World Models, Generative UI, & the Codex Phase for Video https://latent.space/p/xai

@EthanHe_42, former @xai world model lead and @nvidia Cosmos researcher, explains why AI video may follow the same path as coding agents, how Grok Imagine went from zero to one, why text-to-video is only the autocomplete phase, how world models become real-time and interactive, why language models may become the control layer for video, and why the future of AI video may look less like a prompt box and more like an agent with a camera, editor, timeline, and tool belt.

28d6.4K5025

This Week in Startups@twistartups

How a small xAI team shipped a state-of-the-art video model in 3 months:

🦾 Strong talent with a shared goal 🤝 One sync per day 🏗️ All the rest of the time building

Less coordination, more compute, and more iterations.

28d5.5K2411

Zeeshan Patel@zeeshanp_

Great episode on @latentspacepod with @EthanHe_42 covering how modern video foundation models are trained and some of the work we did to build Grok Imagine in 3 months. Ethan has some very interesting takes, especially around how language is the main driver of progress in visual generative models.

Ethan He@EthanHe_42

In @latentspacepod podcast, I shared my view on video generation, world models, LLMs, agents, continual learning and where the next frontier is.

1. Video models get most of their intelligence from language, not from video data. 2. Idea-to-code is fast now. The bottleneck is back to having enough compute to try every idea. 3. Iteration speed beats almost everything else in model development. 4. The next leap won't be a better video model. It'll be a video agent. 5. Diffusion will be the frontend of AGI, the LLM the backend. Generative UI will replace HTML/CSS: user intent straight to pixels. 6. Physical embodiment may become a tool a powerful AI picks up. Robotics may get solved by video-capable LLMs. 7. Continual learning may look like models that manage their own context, and even rewrite their own harness at test time. Thanks @swyx and @vibhuuuus for having me 🙏 https://www.youtube.com/watch?v=jPtQlILfkhA

28d4.7K2511

Zain Shah@zan2434

Latest latentspace pod is excellent. @EthanHe_42 really gets it and lays out a lot of the thinking that led us to Flipbook and how to think about the future of generative UI agents 🤝 video gen 🤝 users

swyx @aiDotEngineer WF Day 1@swyx

This pod was an incredible gift to the community:

not only our first pod about @xAI, but Ethan really indulged on all our questions on how to train a SOTA Videogen world model, including specific areas (consistent extending/editing, voice) that Grok @Imagine is *still* SOTA,

on top of the factual overviews he ALSO came loaded with opinions/predictions:

- why he's quitting Videogen for LLMs: video models get most of their intelligence from LLMs, not from scaling video data - why the next frontier for videogen also happens to be video agent models - agentic models trained to orchestrate video models - why deterministic compression (like MP4) is a useless target vs VAE compression - Videomaxxing: if you truly believe in the "Moore's law" of AI/genmedia, then video models become the final boss UI of everything, like Flipbook (below)

28d5.7K156

swyx @aiDotEngineer WF Day 1@swyx

@xai @imagine see more about flipbook from @zan2434 and @eddiejiao_obj !

Eddie Jiao@eddiejiao_obj

What if your whole computer were just pixels streamed to you from a model? I’ve been working with @zan2434 and @drewocarr to imagine a version of generative computing that’s much more flexible and visually rich than the GUIs we have today.

(Video is sped up and edited)

28d5.6K95

swyx@swyx

@xai @imagine full writeup and links here

https://www.latent.space/p/video-agents

28d1.4K2

Zain Shah@zan2434

@swyx @xai @imagine Love the flipbook shoutout. @EthanHe_42 you really get it!

swyx @aiDotEngineer WF Day 1@swyx

This pod was an incredible gift to the community:

not only our first pod about @xAI, but Ethan really indulged on all our questions on how to train a SOTA Videogen world model, including specific areas (consistent extending/editing, voice) that Grok @Imagine is *still* SOTA,

on top of the factual overviews he ALSO came loaded with opinions/predictions:

- why he's quitting Videogen for LLMs: video models get most of their intelligence from LLMs, not from scaling video data - why the next frontier for videogen also happens to be video agent models - agentic models trained to orchestrate video models - why deterministic compression (like MP4) is a useless target vs VAE compression - Videomaxxing: if you truly believe in the "Moore's law" of AI/genmedia, then video models become the final boss UI of everything, like Flipbook (below)

28d46941

This Week in Startups@twistartups

Check out the full @latentspacepod episode with former xAI world model lead @EthanHe_42 https://www.youtube.com/watch?v=jPtQlILfkhA

28d2.1K61

swyx @aiDotEngineer WF Day 1@swyx

@zan2434 @xai @imagine @EthanHe_42 oh hey its you!! thanks for making flipbook, cant believe this is possible on @modal!!

Zain Shah@zan2434

@swyx @xai @imagine Love the flipbook shoutout. @EthanHe_42 you really get it!

28d42650

Latent.Space@latentspacepod

@EthanHe_42 @xai @nvidia more from Ethan:

28d6171

𝖊𝖉𝖉𝖎𝖊 𝖏𝖎𝖆𝖔@eddiejiao_obj

@swyx @xai @imagine Love this!

28d4381

haro@harobuilds

@swyx @xai @imagine the "video models get their intelligence from LLMs not from video data" take is the one that actually reframes the whole scaling debate. everyone's been throwing compute at video tokens when the ceiling was upstream the whole time

28d192

Paradigma✳️@stas_paradigma

@swyx @xai @imagine most people sleep on how xai's advantage isn't just compute scale but vertical integration - same team shipping grok chat, imagine, and the training infra. means they can iterate on model failures faster than labs stuck coordinating across org silos

28d391

Paul Sant · Telecodex@YouPulseX

@swyx @xai @imagine Call it SOTA when the extending, editing, and voice tests are public enough for someone outside the pod to break.

28d271

Zain Shah@zan2434

@swyx @xai @imagine @EthanHe_42 @modal Ikr? GPUs are magic haha

28d59

Sanket Datta@sanketdattta

@swyx @xai @imagine Yep.

The useful part here is the specificity. Most AI pods stay at the wow-demo layer, but the real value is hearing where the actual bottlenecks still are: consistency, edits, voice, long-horizon control.

That’s the stuff operators can build around.

28d45

Kyriakos@Kyriakos_Pelek

@latentspacepod @EthanHe_42 @xai @nvidia Agent driven video editing feels like the next big leap

28d41

Jahanzaib Ahmed@jahanzaibai

@swyx @xai @imagine Voice consistency is probably the harder problem than visual coherence for production use. It's what breaks first when you're stitching segments.

28d38

James' AI Takes@JamesTakesOnAI

@zeeshanp_ @latentspacepod @EthanHe_42 language is the driver until the video model needs object permanence for more than 4 seconds. text priors are great scaffolding, but reality still has geometry, physics, and annoying little things like causality.

28d26