/Tech1d ago

New Paper Introduces Imaginative Perception Tokens for VLM Spatial Reasoning

122842K
Original postRanjay Krishna#1103

Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)馃У

9:39 PM 路 Jun 8, 2026 路 2K Views
Sentiment

Users praised the collaborative team behind the new IPT method enabling VLMs to reason spatially via visual imagination, expressing explicit gratitude for co-authors and advisors.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS144

(2) Prior work added intermediate visual reps (depth maps, visual thoughts and ...), but those mostly refine what's already visible. IPT predicts the structure that's missing yet implied, externalizing what a VLM would perceive under a different viewpoint.

1dViews 144Likes 1
LIKES1

(9) 馃枃Paper: https://arxiv.org/pdf/2606.03988 馃搳Code/Data: http://mahtabbigverdi.github.io/Imaginative-tokens.github.io/

1dViews 66Likes 1
REPLIES1

(7) And forcing reasoning through text backfires. A verbal chain-of-thought lags plain answer-training on every benchmark; viewpoints, occlusions & cross-view matches don't serialize cleanly into language. Geometry wants a visual workspace, not a sentence.

1dViews 50

(3) We study this with three tasks that genuinely require imagination:

馃敪Perspective Taking: simulate a new camera pose 馃ЛPath Tracing: mentally walk a path, infer what's visible at the midpoint 馃敘Multiview Counting: fuse partial views into one map, count w/o duplicates

1dViews 106

(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.

1dViews 78

(4) ~20K examples per task (AI2-THOR, Habitat, ProcTHOR + real images), each paired with a ground-truth imagination, a novel viewpoint, a sideview, or a top-down map. We fine-tune BAGEL-7B, a unified VLM that generates images, so the imagination lives inside the model.

1dViews 54

(5) Results: the fine-tuned 7B beats GPT-5 on several spatial tasks. Perspective Taking: 96.8 vs 79.8 Multiview Counting: 67.3 vs 53.5 Out-of-domain (Habitat) PET: 87.0 vs 69.3

1dViews 47

(6) The surprise: IPT models don't even draw the picture at test time. Training on imagination alone reshapes the model's internal spatial representations so it reasons better in plain answer mode, no image generated.The supervision is the point, not the inference-time render.

1dViews 44