tl;dr It’s the Age of Research
Something is off about models: 1) Training is so sample inefficient 2) Very long thinking trajectories: models will do the right thing only after exhausting all other possibilities 3) Generalization is whack. Waymo can’t handle construction on highway. No teenager would have this problem.
Unclear why December was an inflection point for coding agents. No single clear attributable change. Google is still pre-December on coding capabilities.
5-10x speed up in work. 3 weeks to implement paper in ye olden days. 2 days with codex + can do many things in parallel.
Humans see, hear, talk, everything all at once. Of course that’s how it should be for models.
Big update on how quickly we got to an intern level coding agent. Didn’t expect that in 2025.

