/Tech1d ago

New Paper Introduces Imaginative Perception Tokens for VLM Spatial Reasoning

122842K

#1103

Original post

Ranjay Krishna#1103

Mahtab Bigverdi@MahtabBg

Picture your living room. If you sat on the sofa, would the TV be on your right or left? You didn't reason in words,you placed yourself in the scene.Imagining in visual space, not text.Exactly what VLMs can't do.Our new paper tackles this with Imaginative Perception Tokens(IPT)🧵

9:39 PM · Jun 8, 2026 · 2K Views

/Tech1d ago

New Paper Introduces Imaginative Perception Tokens for VLM Spatial Reasoning

122842K

#1103

Original post

Ranjay Krishna#1103

Mahtab Bigverdi@MahtabBg

9:39 PM · Jun 8, 2026 · 2K Views

Sentiment

Users praised the collaborative team behind the new IPT method enabling VLMs to reason spatially via visual imagination, expressing explicit gratitude for co-authors and advisors.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Mahtab Bigverdi@MahtabBg

(2) Prior work added intermediate visual reps (depth maps, visual thoughts and ...), but those mostly refine what's already visible. IPT predicts the structure that's missing yet implied, externalizing what a VLM would perceive under a different viewpoint.

1d1441

LIKES1

Mahtab Bigverdi@MahtabBg

(9) 🖇Paper: https://arxiv.org/pdf/2606.03988 📊Code/Data: http://mahtabbigverdi.github.io/Imaginative-tokens.github.io/

1d661

REPLIES1

Mahtab Bigverdi@MahtabBg

(7) And forcing reasoning through text backfires. A verbal chain-of-thought lags plain answer-training on every benchmark; viewpoints, occlusions & cross-view matches don't serialize cleanly into language. Geometry wants a visual workspace, not a sentence.

1d50

Mahtab Bigverdi@MahtabBg

(3) We study this with three tasks that genuinely require imagination:

🔭Perspective Taking: simulate a new camera pose 🧭Path Tracing: mentally walk a path, infer what's visible at the midpoint 🔢Multiview Counting: fuse partial views into one map, count w/o duplicates

1d106

Mahtab Bigverdi@MahtabBg

(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.

1d78

Mahtab Bigverdi@MahtabBg

(4) ~20K examples per task (AI2-THOR, Habitat, ProcTHOR + real images), each paired with a ground-truth imagination, a novel viewpoint, a sideview, or a top-down map. We fine-tune BAGEL-7B, a unified VLM that generates images, so the imagination lives inside the model.

1d54

Mahtab Bigverdi@MahtabBg

(5) Results: the fine-tuned 7B beats GPT-5 on several spatial tasks. Perspective Taking: 96.8 vs 79.8 Multiview Counting: 67.3 vs 53.5 Out-of-domain (Habitat) PET: 87.0 vs 69.3

1d47

Mahtab Bigverdi@MahtabBg

(6) The surprise: IPT models don't even draw the picture at test time. Training on imagination alone reshapes the model's internal spatial representations so it reasons better in plain answer mode, no image generated.The supervision is the point, not the inference-time render.

1d44