Fact 2: However, agents appeared to be significantly weaker on tasks where it is costly or hard to verify success.
Fact 1: Agents at companies did real engineering work autonomously, especially on "hill-climbable" tasks where progress is cheap to verify (reimplementation, vulnerability discovery, optimization). On these, agents complete software projects that would take human experts weeks.



