1d ago

Chan Hee Song and Yu Su find weaker vision-language models rely on visual shortcuts instead of true 3D depth representations

The researchers developed SpatialTunnel to evaluate spatial VQA.

Sentiment

Pos100%

Neg0%

Users appreciate the study probing whether VLMs truly grasp 3D space or exploit image shortcuts, citing its importance for real-world physical deployments and showing interest in trying the benchmarks.

4 comments with sentiment.

Chan Hee Song and Yu Su find weaker vision-language models rely on visual shortcuts instead of true 3D depth representations · Digg