/AI17h ago

LongCat Releases WBench Benchmark for Interactive Video World Models

--0--
Original posts
Comments
Reposts
Original post
Rohan Paul@rohanpaul_ai#1032inAI

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints.

Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

2:16 AM · Jun 2, 2026 · 3.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS956BOOKMARKS1LIKES3REPLIES2
Rohan Paul@rohanpaul_ai

🧵 2. An interactive world model is closer to a simulator than a video filter, so the real test is whether the world survives use.

WBench makes that difference concrete by separating the initial world from the later user actions.

The model first gets a setting, such as a scene, style, subject, and viewpoint, then it has to handle a chain of interactions like moving forward, making the subject jump, changing an event, or switching from third-person to first-person view.

Rohan Paul@rohanpaul_ai

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints.

Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

17hViews 956Likes 3Bookmarks 1
RETWEETS7
Rohan Paul@rohanpaul_ai

Most video models look better than they understand and Video quality is only the easiest thing to notice.

LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility.

It exposed the gap between beautiful video generation and controllable world simulation.

A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect.

WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints.

Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability.

Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics.

Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command.

The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?”

🧵 1.

17hViews 3.2KLikes 14Bookmarks 7