FineVLA Makes Robot Policies Steerable With Varied Language Instructions

VIEWS4.5KBOOKMARKS17LIKES27RETWEETS1

Robots can pick&place well now, but ask them to push sth over, drag it closer, or roll it — and they fail!

The gap isn't task difficulty; it's that VLA embodied instruction following is brittle and doesn't generalize to how to execute.

@Erics_Tong's work tackles exactly this!👇

xintong hu@Erics_Tong

Current robot policies overfit specific language templates, handling 'pick and place' but freezing on 'drag it to me ' or 'push it closer to me.' They also lack control over execution: which hand, what approach angle, where to grasp, which path to follow.

🤖 FineVLA make robots steerable : changing instruction alters execution; same task, different phrasing, distinct actions — all faithfully done.

🏠 Homepage: https://finevla.xlang.ai 📄 Paper: https://huggingface.co/papers/2605.27284 💻Codebase: https://github.com/xlang-ai/FineVLA

🧵[1/6]

1d4.5K2717

REPLIES1

xintong hu@Erics_Tong

🧵[5/6] Key findings: 🔬 i. No Sacrifice — Fine-grained data doesn't hurt goal-level success. FG-only consistently outperforms Raw-only by +1.4 to +8.1 pts. The OFT-vs-GR00T architecture gap shrinks from 6.4 to just 0.8 — showing strong cross-architecture generalization.

🔬 ii. Complementary — FG and raw instructions are complementary, not competing. Performance follows a clear inverted-U, peaking at FG:Raw = 1:1. Best mix: 86.8%/82.5% in simulation(+15/+11.1 over Raw-only ), 62.7 in real-world (+12.8 over Raw-only).

🔬 iii. Steerable — Fine-grained language gives robots true factor-level controllability. Same task, different instructions → different execution: - Object pose: 24 → 47 (+23) - Approach direction: 60 → 78 (+18) - Target color: 22 → 40 (+18) - Rotation: 76 → 86 (+10)

1d14

xintong hu@Erics_Tong

🧵[2/6] FineVLA-Tool scales fine-grained annotation without labeling all data: unify 972K trajectories from 10 robot datasets, use DTW clustering to pick representatives, then annotate 10 execution dims. FineVLA-Data: 47K human-verified trajectories with 10.4× richer instructions.

1d12

xintong hu@Erics_Tong

🧵[3/6] RoboFine-Bench tests VLMs as scalable robot-video annotators via two tracks: VQA—1,030 questions over 10 fine-grained execution dims; Caption—500 videos labeled with 11,632 atomic facts, matching VLM outputs to quantify fact coverage, hallucination & anti-hallucination. 🤗Benchmark : https://huggingface.co/datasets/xlangai/RoboFine-bench

1d7

xintong hu@Erics_Tong

🧵[4/6] RoboFine-VLM fine-tunes from Qwen3.5-397B-A17B on FineVLA-Data as a scalable robot-video annotator. On RoboFine-Bench, it reaches VQA 68.2 (+8.0/+8.6 over GPT-5.4/Gemini-3.1-Pro) and Caption-Hard 82.2 (+4.2/+6.3), with 71.6% coverage and only 5.2% hallucination. 🤗Annotator: https://huggingface.co/xlangai/RoboFine-VLM-397B-A17B

1d4

xintong hu@Erics_Tong

🧵[6/6]

FineVLA makes robot policies steerable by specifying how to act, not just what to do. We open-source the full stack: ✅ Fine-grained data annotation pipeline ✅ RoboFine-VLM annotator ✅ RoboFine-Bench benchmark 🏠 Homepage: https://finevla.xlang.ai 📄 Paper: https://huggingface.co/papers/2605.27284 💻Codebase: https://github.com/xlang-ai/FineVLA 🤗Annotator: https://huggingface.co/xlangai/RoboFine-VLM-397B-A17B 🤗Benchmark : https://huggingface.co/datasets/xlangai/RoboFine-bench

1d6