View Graph Distillation Raises VLM 3D Camera Planning Success to 47.8%

VIEWS382LIKES2REPLIES1

Manling Li@ManlingLi_

Q1. What are the failure modes?

ViewSuite: ~165K task instances 286 real indoor scenes from ScanNet. 6-DoF camera control 12 actions

Manling Li@ManlingLi_

Planning with the views:

Can VLMs predict how each camera move changes the view, and plan many such moves ahead?

We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning

A sharp Planning Gap emerges: + can roughly "track" how camera action changes views - cannot "compose" a plan towards a target view at all

We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B. + With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%

Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?

Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang.

A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.

3h38220

BOOKMARKS1

Manling Li@ManlingLi_

Website: https://viewsuite.github.io Paper: https://arxiv.org/pdf/2605.29563 Data: https://huggingface.co/collections/MLL-Lab/viewsuite-datasets Model: https://huggingface.co/collections/MLL-Lab/viewsuite-models Code: https://github.com/mll-lab-nu/ViewSuite

Manling Li@ManlingLi_

Q3. What Has the Model Learned? Can such spatial priors transfer to other view related tasks?

Exploration strategy: Scene coverage grows rapidly in early turns as the agent explores broadly, then plateaus; target intersection ratio accelerates in the middle turns as the agent moves toward the target view. → It is a two-phase pattern (explore then approach).

Training reshape the model’s attention: We open up VLMs, and the trained model allocates more attention to image tokens than the base model across most layers and turns.

Learned spatial priors transfer to other view-related tasks: Directly transfer will be a slightly lower, but given same training data, our model can get higher performance.

3h12811

Manling Li@ManlingLi_

13 frontier VLMs show a clear planning gap:

+ can roughly track local view changes (27.4%-53.2%) - but collapse when they must compose them into a plan (2.2%-34.8%)

Failure Modes:

1. Single-turn tracking can be roughly understood, but multi-turn planning totally collapses. 2. More turns do not close the gap 3. Higher rendering quality does not help 4. What predicts failure is view distance: rotation distance for the tracking tasks (Path2View and View2Path) position distance for planning (Interactive View Planning) 5. Models usually solve view planning by moving until they see the target view, then matching it, rather than by inferring beforehand.

The planning gap is therefore more a cognitive gap: even with the global top-down map in hand, frontier VLMs can rarely anchor egocentric views onto the map, mentally simulate how camera actions change those views, or localize a target view before seeing it.

Prospective spatial reasoning is thus a harder, higher-level capability than tracking.

View planning offers a clean testbed for building prospective spatial reasoning in VLMs.

Manling Li@ManlingLi_

Q1. What are the failure modes?

ViewSuite: ~165K task instances 286 real indoor scenes from ScanNet. 6-DoF camera control 12 actions

3h16710

Manling Li@ManlingLi_

Q3. What Has the Model Learned? Can such spatial priors transfer to other view related tasks?

Exploration strategy: Scene coverage grows rapidly in early turns as the agent explores broadly, then plateaus; target intersection ratio accelerates in the middle turns as the agent moves toward the target view. → It is a two-phase pattern (explore then approach).

Training reshape the model’s attention: We open up VLMs, and the trained model allocates more attention to image tokens than the base model across most layers and turns.

Learned spatial priors transfer to other view-related tasks: Directly transfer will be a slightly lower, but given same training data, our model can get higher performance.

Manling Li@ManlingLi_

Q2. Can Reinforcement Learning (RL) teach VLMs such planning ability? How can we make RL work?

A natural way to learn planning RL. With a naive policy succeeding only∼2.5%, PPO/GRPO/SFT-boosting can mostly achieve 6.2%.

Our way past this bottleneck comes from a simple observation: every trajectory, successful or not, traces valid view transitions.

Distilling valid view transitions from raw exploration is then the central challenge.

We construct a view graph, an any-view-to-any-view map assembled from the agent’s on-policy self- exploration.

We then distill it into supervised demonstrations for view planning, and iterate this distillation with further self-exploration.

As the policy improves, its exploration grows the view graph outward iteration by iteration, and the resulting supervision remains matched to the region of view space the agent can actually plan over.

Like on-policy distillation, our framework learns from the agent’s on-policy exploration; Unlike it, there is no stronger teacher to imitate, and the teacher is the environment itself, whose structure the agent reveals by moving through it.

With View Graph Distillation, finally 2.5% → 47.8%

3h14510