XLANG Lab releases OSWorld 2.0, a computer-use agent benchmark where frontier models achieve just 20.6% accuracy

VIEWS11.9KBOOKMARKS21LIKES60REPLIES2

i still remember the discussion of the case of us visa application and i thought that must be mission impossible for ai and just said u guys must be crazy... now it seems that i am the dumb hahah!

anyway, time for frontier models to fight again! bon courage!

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

2h11.9K6021

RETWEETS16

Tao Yu@taoyds

From OSWorld 1.0 to 2.0, we went from minutes (~30 steps) to hours (~318), from single apps to real workflows, from high scores (83%) to hard problems (21%).

1+ year, 20+ people, every task rigorously verified. This is what real cua evaluation takes.🙏

👉http://osworld-v2.xlang.ai

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

3h3.4K567

Yu Su@ysu_nlp

OSWorld 2 is taking CUA evaluation to the next level of complexity and realism. Congrats to the team! Glad to contribute to this important effort.

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

2h1.5K222

Jing Yu Koh@kohjingyu

Very exciting to see more work on benchmarking long-horizon CUAs!

IMO there's only two main challenges remaining until CUAs become mainstream and productionizable: (1) effective long-horizon agents and (2) fast/realtime CUAs. I think we are going to get there very soon.

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

1h1.2K142

XLANG NLP Lab@XLangNLP

🧵[7/8] Contributions and Acknowledgments

Huge thanks to the amazing team behind OSWorld-V2 🚀 Led by @yuan_mengq43669, @adlsdztony1 & Xinzhuang Xiong, with@xhluca, @saa1605, @itsyuhao, @jiaqideng07, @xywang626, @DunjieLu1219, @BowenWangNLP, @vincentsunnchen and the whole XLANG crew making it happen.

Deepest gratitude to @taoyds for steering the project, and to our advisors @TianbaoX, @fredsala, @zhouyu, @ysu_nlp, @sivareddyg, @xwang_lk, @JustinLin610, @dayiheng_liu & @PengQi for their guidance throughout 🙏

We thank @SnorkelAI, our research & data partner, for their support of this work. We gratefully acknowledge support from the @GoogleResearch gift fund.

3h13114

XLANG NLP Lab@XLangNLP

🧵[5/8] The main result is not just a leaderboard. It is a cost–completion frontier.

On OSWorld 1.0, frontier agents are already near 83% accuracy. On OSWorld 2.0, the best 500-step setting still completes only 20.6% of tasks.

🏆 Claude Opus 4.8 reaches the best binary completion: 20.6% ⚡ GPT-5.5 is much more token-efficient: ~13.0% with only ~37K output tokens/task 🧠 Claude Opus 4.7 reaches 18.2%, but needs ~150K tokens/task 💸 Opus 4.8 pushes to 20.6%, but with ~224K tokens/task

The trade-off is steep: extra inference mostly buys partial progress, not reliable completion.

Strong agents can often get halfway through a workflow. The hard part is the last mile: preserving state, resolving conflicts, reacting to updates, and producing a correct final state. And as the task horizon grows, binary completion collapses. Long-horizon computer use is still very far from solved.

3h8012

XLANG NLP Lab@XLangNLP

🧵[2/8] What does "long-horizon and real-world" mean here?

In one reimbursement task example, the agent must: 📄 follow a tutorial PDF 🏢 operate a legacy ExpenseFlow portal 📬 extract amounts from noisy receipt emails 🏦 cross-check evidence across bank transactions and email 📨 notice a new email that changes the task mid-execution 🔎 recover hidden employee info from a prior report 🧾 prepare valid supporting documents ⚠️ detect inconsistent evidence 🙋 ask the user instead of guessing ✅ complete final review and submission

This is the kind of workflow where current agents still break.

3h12311

XLANG NLP Lab@XLangNLP

🧵[3/8] Existing desktop benchmarks now risk looking close to solved. The tasks are often short, narrow, and self-contained, so high scores can hide whether agents can actually finish the end-to-end work users care about.

OSWorld 2.0 changes the horizon: ⏱ OSWorld 1.0 median human time: ~2 minutes ⏱ OSWorld 2.0 median human time: ~1.6 hours 📈 ≈48× longer by median time 📌 69.6% of tasks take skilled humans more than one hour 🛠 agent trajectories shift from ~30 steps in OSWorld 1.0 to >250 steps per task in OSWorld 2.0

This is workflow endurance, not isolated GUI control.

3h9511

XLANG NLP Lab@XLangNLP

🧵[4/8] OSWorld 2.0 is designed to be broad enough to be real and controlled enough to score.

The 108 tasks span everyday and professional work across research & education, creative production, engineering & computing, personal services, administration & compliance, business & finance, and healthcare.

The benchmark covers: 🧾 authentic or adapted real-world artifacts: PDFs, spreadsheets, emails, receipts, forms, reports, videos, drawings, and portals 🧠 coherent stateful user profiles 🔄 controlled dynamic updates during execution 🙋 simulated user answers with bounded knowledge ✅ fine-grained final-state checks instead of only binary pass/fail 🔍 QA with unit tests, human re-solving, frontier-agent rollouts, reward-hacking audits, and false-negative audits

3h7911

XLANG NLP Lab@XLangNLP

🧵[6/8] The failures are the main lesson.

Agents execute local actions well, but fail to maintain task-level state over long horizons:

🧭 stated constraints drift after many steps 📨 mid-task updates are missed 🕵️ hidden state is lost ⚖️ conflicts are guessed through instead of clarified ✅ final-state verification is skipped 🔧 error detection and repair is failed

OSWorld 2.0 targets these bottlenecks directly: cross-source reasoning, implicit-state inference, multi-item tracking, conflict disambiguation, dynamic environments, streaming interaction, tutorial following, multimodal editing, visual-spatial precision, and proactive interaction.

Progress needs agents that can monitor change, preserve state, ask when evidence is insufficient, and catch mistakes before they propagate.

3h7311

XLANG NLP Lab@XLangNLP

🧵[8/8] OSWorld 2.0 is built to make long-horizon computer-use evaluation realistic, inspectable, and reproducible.

We believe the next leap in computer-use agents won't come from better GUI clicking — it will come from agents that can read instructions, gather evidence, track state across hundreds of steps, resolve conflicts, and verify their own outputs before submitting.

OSWorld 2.0 is designed to measure exactly that gap — and to drive the field toward closing it.

We release everything to support the community:

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks 📦 Trajectories: https://huggingface.co/datasets/xlangai/osworld2.0-trajectory 🧭 Trajectory Viewer: https://osworld-v2-monitor.xlang.ai

3h9012

Xin Eric Wang@xwang_lk

Time to dive into more practical and longer horizon computer use tasks. OSWorld 2.0 arrives at the right time. Excited about the release and congrats to the team! Happy to contribute.

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

1h67260

Noema@noemaclips

@XLangNLP @huybery Interesting one! @stalkermustang @xeophon @spicey_lemonade

2h111

Mengqi Yuan@yuan_mengq43669

After 15+ months of work, I’m thrilled to finally share OSWorld 2.0 🚀

Huge thanks to all the collaborators, advisors, and friends who supported this project from early ideas to release.

Hope OSWorld 2.0 helps push the CUA forward. Excited to hear feedback and discuss more!🙌

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

3h3K436

Tianbao Xie@TianbaoX

@JustinLin610 Mission continued sir!

2h522

Snorkel AI@SnorkelAI

108 real-world, long-horizon computer-use workflows. Average rollout: 318 tool calls.

Top frontier agent (Claude Opus 4.8 with max thinking + batched tool calls): 20.6% end-to-end completion (54.8% partial progress). Partial progress is real. Reliable end-to-end computer use is not.

Proud to be @XLangNLP's research and data partner on OSWorld 2.0. @qi_zhengyang, @vincentsunnchen, and @fredsala contributed from Snorkel.

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]

2h718221

Peng Qi@qi2peng2

When working on GUI automation at @OrbyAI , one of the key challenges we faced was collecting challenging and meaningfully realistic GUI tasks. Excited to see OSWorld 2.0 finally released and glad to have played a small part in the project!

XLANG NLP Lab@XLangNLP

Two years ago, we built OSWorld 1.0 — the benchmark that became the standard for computer-use agents. Agents now score 83.5% on it. Problem solved?

Not even close.

🚀Today we introduce OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks.

What's new: 🎯 108 real-world workflows, each ~1.6 hours ⏱️ for a skilled human ⚙️ ~318 tool calls/task vs. ~30 in OSWorld 1.0 🌍 Grounded in authentic artifacts & stateful user profiles ⚡ Captures real phenomena: dynamic environments, streaming interaction, cross-source reasoning, implicit-state inference & more

📊 Best results: Claude Opus 4.8 reaches the highest accuracy at 20.6%, while GPT-5.5 is far more token-efficient but plateaus near 13%. No one is close to solving real computer use.

🏠 Homepage: https://osworld-v2.xlang.ai 📄 Paper: https://github.com/xlang-ai/OSWorld-V2/blob/main/OSWorld2.0.pdf 💻 Code: https://github.com/xlang-ai/OSWorld-V2 🤗 Dataset: https://huggingface.co/datasets/xlangai/osworld_v2_tasks

🧵 [1/8]