At last week's developer conference, Google claimed that their newest frontier model produced an operating system from just a single prompt and ~$900 in API cost. At first sight, this seems impressive. But on closer look, the evidence is much thinner than the headline suggests.
Most notably:
- The "single prompt" framing suggests that the agent could do this from just a few sentences with high-level instructions. But the prompt itself is many thousands of lines long and it is unclear what instructions Google provided in the prompt (and how much effort it took to even come up with the prompt in the first place).
- There is a lot of OS code on the internet and it is often attempted as a class project in college OS classes. Based on the information provided in Google's blog post, it is unclear to what extent the agent simply copied a well-known implementation from the internet.
- In such long-running, complex implementation tasks, it is important to understand what degree of human intervention was performed to help the agent achieve its goals. However, Google remains ambiguous about the level of hand-holding they performed in this experiment. They say that "no additional guidance or corrections from a human" were necessary, yet they document instances of imposing anti-cheating mechanisms between runs.
- Many key artifacts, such as the code, the prompt, and agent logs are unreleased. This makes it impossible for external researchers to verify these marketing claims.
- To Google's credit, they did release the overall cost and token budget. These details often remain undisclosed, and sharing them is a first step in the right direction.
More detail in our writeup at https://www.normaltech.ai/p/did-googles-ai-agents-really-build
w/ @sayashk @RishiBommasani Andrew Schwartz @random_walker