13h ago

Cambrian series co-creator Saining Xie introduces Cambrian-P, an MLLM grounded in camera pose for spatial reasoning

The architecture uses pose tokens instead of heavy 3D modules.

0
Original post

Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics in video. We trace this gap to a missing piece: camera pose. Introducing Cambrian-P: a multimodal LLM natively grounded in camera pose. (1/n)

4:14 PM · May 26, 2026 View on X
Reposted by

📸latest in our cambrian series: cambrian-p, p for pose. i think pose is probably the minimal sufficient 3d signal (and it’s easy to get!) that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.

Jihan YangJihan Yang@jihanyang13

Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the underlying space and ego/object dynamics in video. We trace this gap to a missing piece: camera pose. Introducing Cambrian-P: a multimodal LLM natively grounded in camera pose. (1/n)

11:14 PM · May 26, 2026 · 24.1K Views
2:12 AM · May 27, 2026 · 11.2K Views