MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
Paper: https://arxiv.org/abs/2606.16748 Code, Environment and Tasks, Agentic harness: https://mypcbench.com/
MyPCBench provides a Linux desktop environment with 17 simulated real-world web applications and a complete desktop software stack.
The benchmark contains 184 tasks inspired by authentic OpenClaw community requests and evaluates agents through a unified computer-use + bash interface.
We benchmark leading closed- and open-weight models and find that the strongest model, Claude Opus 4.6, successfully completes 55.4%of tasks. We find that failures are concentrated on long-horizon, multi-application workflows, highlighting that personalization and persistent user context remain key challenges for the next generation of AI personal assistants.
Joint work with @JangLawrenceK, @andrewkjang7 and @kohjingyu.