>Why has progress on computer use been so slow? I actually think progress has picked up a ton this year. The frontier models (anything past Opus 4.6 / GPT-5.4) are actually usable now. In 2025 none of the models were really reasonable beyond very simple tasks.
>Computer use is so clearly verifiable. Most computer use tasks have parts that are verifiable, and parts that are not (which is why rubrics can be helpful in grading CUA tasks). This also makes it a bit more complex to RL compared to math/coding. Computer use is also harder to write unit tests for (since you don't always have privileged/API access), so many people use LLM-as-a-judge grading, which is finicky in its own ways.
Here's a question I find confusing and interesting and which actually tells us a lot about the nature of current AI progress:
Why has progress on computer use been so slow? Computer use is so clearly verifiable.
I think the answer is that it is not enough for a domain to be verifiable.
It also has to be very grindable—in the sense that you can run lots of parallel rollouts against a deterministic and replayable simulator.
If you’re trying to make a model better at coding, you can create an environment that has a software repo with some missing feature that you’ve tasked the AIs with creating, and then you have a thousand parallel agents just go at the problem, each with their identical copy of the container.
But this doesn’t work with computer use—at least not trivially. You can’t have a thousand agents go try the same checkout flow on Amazon. Because Andy Jassy will find and detect your bots and shut your ass down.
How would we train an AI to build a business? How would you make an AI that’s really good at winning court cases? Or having a profitable day trading in the markets? Or helping a candidate win an election?
What is the RL environment to make an AI as good at politics as Lyndon Johnson, or as good at building a space launch business as Elon Musk?
The rollout requires interacting with the world and cannot be recreated simply within the datacenter. And the outer loop verification may take months or years of real world actions to elicit, and cannot be re-observed by perturbing the model’s actions thousands of times in parallel so that you can isolate what exactly the model did that actually worked.