/Tech1d ago

New Paper Introduces Imaginative Perception Tokens for VLM Spatial Reasoning

119631.7K

Original post unavailable.

/Tech1d ago

New Paper Introduces Imaginative Perception Tokens for VLM Spatial Reasoning

119631.7K

Original post unavailable.

Sentiment

Users express gratitude to co-authors and advisors for the new IPT method that lets VLMs reason spatially via visual imagination.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Mahtab Bigverdi@MahtabBg

(2) Prior work added intermediate visual reps (depth maps, visual thoughts and ...), but those mostly refine what's already visible. IPT predicts the structure that's missing yet implied, externalizing what a VLM would perceive under a different viewpoint.

1d1441

LIKES1

Mahtab Bigverdi@MahtabBg

(9) 🖇Paper: https://arxiv.org/pdf/2606.03988 📊Code/Data: http://mahtabbigverdi.github.io/Imaginative-tokens.github.io/

1d661

REPLIES1

Mahtab Bigverdi@MahtabBg

(7) And forcing reasoning through text backfires. A verbal chain-of-thought lags plain answer-training on every benchmark; viewpoints, occlusions & cross-view matches don't serialize cleanly into language. Geometry wants a visual workspace, not a sentence.

1d50

Mahtab Bigverdi@MahtabBg

(3) We study this with three tasks that genuinely require imagination:

🔭Perspective Taking: simulate a new camera pose 🧭Path Tracing: mentally walk a path, infer what's visible at the midpoint 🔢Multiview Counting: fuse partial views into one map, count w/o duplicates

1d106

Mahtab Bigverdi@MahtabBg

(8) Couldn't have done this without my amazing co-authors @LINJIEFUN & @weikaih04 , my advisors Linda Shapiro & @RanjayKrishna , and all my collaborators. So grateful to have worked with this team.

1d78

Mahtab Bigverdi@MahtabBg

(4) ~20K examples per task (AI2-THOR, Habitat, ProcTHOR + real images), each paired with a ground-truth imagination, a novel viewpoint, a sideview, or a top-down map. We fine-tune BAGEL-7B, a unified VLM that generates images, so the imagination lives inside the model.

1d54

Mahtab Bigverdi@MahtabBg

(5) Results: the fine-tuned 7B beats GPT-5 on several spatial tasks. Perspective Taking: 96.8 vs 79.8 Multiview Counting: 67.3 vs 53.5 Out-of-domain (Habitat) PET: 87.0 vs 69.3

1d47

Mahtab Bigverdi@MahtabBg

(6) The surprise: IPT models don't even draw the picture at test time. Training on imagination alone reshapes the model's internal spatial representations so it reasons better in plain answer mode, no image generated.The supervision is the point, not the inference-time render.

1d44