Chan Hee Song and Yu Su find weaker vision-language models rely on visual shortcuts instead of true 3D depth representations · Digg