
(2) Prior work added intermediate visual reps (depth maps, visual thoughts and ...), but those mostly refine what's already visible. IPT predicts the structure that's missing yet implied, externalizing what a VLM would perceive under a different viewpoint.