/AI17h ago

LongCat Releases WBench Benchmark for Interactive Video World Models

1525785.8K

Original posts

#1032

Comments

#1032

Reposts

#1032

Original post

Rohan Paul@rohanpaul_ai#1032inAI

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints.

Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

2:16 AM · Jun 2, 2026 · 3.2K Views

/AI17h ago

LongCat Releases WBench Benchmark for Interactive Video World Models

--0--

Original posts

#1032

Comments

#1032

Reposts

#1032

Original post

Rohan Paul@rohanpaul_ai#1032inAI

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

2:16 AM · Jun 2, 2026 · 3.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS956BOOKMARKS1LIKES3REPLIES2

Rohan Paul@rohanpaul_ai

🧵 2. An interactive world model is closer to a simulator than a video filter, so the real test is whether the world survives use.

WBench makes that difference concrete by separating the initial world from the later user actions.

The model first gets a setting, such as a scene, style, subject, and viewpoint, then it has to handle a chain of interactions like moving forward, making the subject jump, changing an event, or switching from third-person to first-person view.

Rohan Paul@rohanpaul_ai

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

17h95631

RETWEETS7

Rohan Paul@rohanpaul_ai

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

17h3.2K147