Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?
Not even close.
🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.
What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more
📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.
🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks
🧵 [1/8]





