New CMU research shows almost any software can become a training ground for AI agents.
Imo, that is a big deal because real work in apps is long, messy, and different across software, so AI agents need realistic places to learn and be judged.
Their result also shows the bad news: once the tasks look like real work, today’s agents still fail a lot.
Most current agent benchmarks use small web or desktop tasks, so they do not show whether agents can handle real workplace software.
Gym-Anything attacks the setup bottleneck by making environment creation itself an agent job.
One agent writes scripts, installs software, loads real data, opens the app, and collects proof that it works.
A second agent audits that proof with screenshots, logs, files, and checklists, then sends fixes back when the setup is weak.
Using this loop, the authors built CUA-World, with 10,000+ tasks across 200 applications covering all 22 major occupation groups.
The result shows even strong models solved only a small share of the hardest long tasks, showing that real computer-use work is still far from solved.
----
– arxiv. org/abs/2604.06126
Title: "Gym-Anything: Turn any Software into an Agent Environment"










