/Tech2h ago

WeaveBench Introduces 114-Task Hybrid Benchmark for Computer-Use Agents

1515563

Original post

A long-horizon hybrid-interface benchmark for CUA with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated.

5:23 AM · Jun 14, 2026 · 340 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS246BOOKMARKS1LIKES1

Suzana Ilić@suzatweet

Paper https://arxiv.org/abs/2606.09426

Suzana Ilić@suzatweet

2h24611