Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)馃У
Users praised the collaborative team behind the new IPT method enabling VLMs to reason spatially via visual imagination, expressing explicit gratitude for co-authors and advisors.
Most Activity

(2) Prior work added intermediate visual reps (depth maps, visual thoughts and ...), but those mostly refine what's already visible. IPT predicts the structure that's missing yet implied, externalizing what a VLM would perceive under a different viewpoint.

(9) 馃枃Paper: https://arxiv.org/pdf/2606.03988 馃搳Code/Data: http://mahtabbigverdi.github.io/Imaginative-tokens.github.io/

(7) And forcing reasoning through text backfires. A verbal chain-of-thought lags plain answer-training on every benchmark; viewpoints, occlusions & cross-view matches don't serialize cleanly into language. Geometry wants a visual workspace, not a sentence.

(3) We study this with three tasks that genuinely require imagination:
馃敪Perspective Taking: simulate a new camera pose 馃ЛPath Tracing: mentally walk a path, infer what's visible at the midpoint 馃敘Multiview Counting: fuse partial views into one map, count w/o duplicates

(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.

(4) ~20K examples per task (AI2-THOR, Habitat, ProcTHOR + real images), each paired with a ground-truth imagination, a novel viewpoint, a sideview, or a top-down map. We fine-tune BAGEL-7B, a unified VLM that generates images, so the imagination lives inside the model.

(5) Results: the fine-tuned 7B beats GPT-5 on several spatial tasks. Perspective Taking: 96.8 vs 79.8 Multiview Counting: 67.3 vs 53.5 Out-of-domain (Habitat) PET: 87.0 vs 69.3

(6) The surprise: IPT models don't even draw the picture at test time. Training on imagination alone reshapes the model's internal spatial representations so it reasons better in plain answer mode, no image generated.The supervision is the point, not the inference-time render.