1d ago

Chan Hee Song finds vision-language models rely on 2D shortcuts, like vertical position, instead of understanding true 3D spatial relationships

The team introduced SpatialTunnel to test true spatial reasoning.

312831726.6K

——0——

Original post

#383@YSU_NLPOP

Chan Hee (Luke) Song@LUKE_CH_SONG

Do VLMs actually understand 3D space 🌎? Or are they exploiting shortcuts hidden in natural images? 🚀 Excited to share our new work: Why Far Looks Up: Probing Spatial Representation in Vision-Language Models @NVIDIAAI × @SeoulNatlUni × @OhioStateCSE 🧵👇

3:03 PM · May 29, 2026

Reposted by

#888@FREDAHSHI

Chan Hee Song finds vision-language models rely on 2D shortcuts, like vertical position, instead of understanding true 3D spatial relationships

Cluster engagement

Sentiment