CMU's Russ Salakhutdinov releases MyPCBench to evaluate personal computer-use AI agents in logged-in environments

Original post

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Paper: https://arxiv.org/abs/2606.16748 Code, Environment and Tasks, Agentic harness: https://mypcbench.com/

MyPCBench provides a Linux desktop environment with 17 simulated real-world web applications and a complete desktop software stack.

The benchmark contains 184 tasks inspired by authentic OpenClaw community requests and evaluates agents through a unified computer-use + bash interface.

We benchmark leading closed- and open-weight models and find that the strongest model, Claude Opus 4.6, successfully completes 55.4%of tasks. We find that failures are concentrated on long-horizon, multi-application workflows, highlighting that personalization and persistent user context remain key challenges for the next generation of AI personal assistants.

Joint work with @JangLawrenceK, @andrewkjang7 and @kohjingyu.

7:55 AM · Jun 16, 2026 · 1.9K Views

My PCBench

GITHUB.COMVia

RETWEETS2

Delta Institute@DeltaInstitutes

If you’re excited about computer-use agents, check out Lawrence’s new paper!!

Lawrence Jang@JangLawrenceK

Computer-use evals like OSWorld still don’t really test personal assistant use cases: logged-in accounts, user data, personalized workflows, or realistic desktop/web environments.

so we made MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents, with 184 tasks across 17 popular website clones, seeded with realistic user data and centered on Michael Scott’s hypothetical desktop.

It’s easy to adopt if you already use OSWorld-style runners - I view it as an personalization-focused, more realistic upgrade for CUA evals.

Website: https://mypcbench.com/ Paper: https://arxiv.org/abs/2606.16748 Code: https://github.com/ljang0/MyPCBench

1h96054

Lawrence Jang@JangLawrenceK

Computer-use evals like OSWorld still don’t really test personal assistant use cases: logged-in accounts, user data, personalized workflows, or realistic desktop/web environments.

It’s easy to adopt if you already use OSWorld-style runners - I view it as an personalization-focused, more realistic upgrade for CUA evals.

Website: https://mypcbench.com/ Paper: https://arxiv.org/abs/2606.16748 Code: https://github.com/ljang0/MyPCBench

1h1.7K148