Qwen Team releases Qwen-VLA, a vision-language-action model achieving 97.9% success on the LIBERO robotics benchmark
A Diffusion Transformer-based action decoder enables direct physical control.
paper: https://huggingface.co/papers/2605.30280
Qwen-VLA Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA
Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Finally, this got done. It always felt off when people were treating VLAs as a separate class from multimodal LLMs. Better late than never.
Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models. It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction. Paper: https://arxiv.org/pdf/2605.30280
Qwen isn't giving up its leadership in multimodality. It'll be interesting to watch how VLAs and world model-based-approaches compete, I think by 2027 we should have an answer.
Excited to share Qwen-VLA paper, our exploration of generalist Vision-Language-Action models. It extends Qwen’s multimodal backbone from visual understanding and reasoning to continuous action generation and trajectory prediction. Paper: https://arxiv.org/pdf/2605.30280