17h ago

Qwen Team releases Qwen-VLA, a vision-language-action model achieving 97.9% success on the LIBERO robotics benchmark

A Diffusion Transformer-based action decoder enables direct physical control.

0
Original post

Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models. It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction. Paper: https://arxiv.org/pdf/2605.30280

8:31 PM · May 28, 2026 View on X

paper: https://huggingface.co/papers/2605.30280

AKAK@_akhaliq

Qwen-VLA Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

3:50 PM · May 29, 2026 · 3.3K Views
3:50 PM · May 29, 2026 · 2.5K Views

Qwen-VLA

Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

3:50 PM · May 29, 2026 · 3.3K Views

Finally, this got done. It always felt off when people were treating VLAs as a separate class from multimodal LLMs. Better late than never.

Shuai BaiShuai Bai@shuai_bai_

Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models. It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction. Paper: https://arxiv.org/pdf/2605.30280

3:31 AM · May 29, 2026 · 34.6K Views
6:00 PM · May 29, 2026 · 1.3K Views

Qwen isn't giving up its leadership in multimodality. It'll be interesting to watch how VLAs and world model-based-approaches compete, I think by 2027 we should have an answer.

Shuai BaiShuai Bai@shuai_bai_

Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models. It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction. Paper: https://arxiv.org/pdf/2605.30280

3:31 AM · May 29, 2026 · 34.6K Views
4:28 AM · May 29, 2026 · 3K Views