1d ago

Chan Hee Song finds vision-language models rely on 2D shortcuts, like vertical position, instead of understanding true 3D spatial relationships

The team introduced SpatialTunnel to test true spatial reasoning.

0
Original post

Do VLMs actually understand 3D space 🌎? Or are they exploiting shortcuts hidden in natural images? 🚀 Excited to share our new work: Why Far Looks Up: Probing Spatial Representation in Vision-Language Models @NVIDIAAI × @SeoulNatlUni × @OhioStateCSE 🧵👇

3:03 PM · May 29, 2026 View on X
Reposted by