A long-horizon hybrid-interface benchmark for CUA with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated.
5:23 AM · Jun 14, 2026 · 340 Views