Planning with the views:
Can VLMs predict how each camera move changes the view, and plan many such moves ahead?
We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning
A sharp Planning Gap emerges: + can roughly "track" how camera action changes views - cannot "compose" a plan towards a target view at all
We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B. + With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%
Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?
Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang.
A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.
