i've been running Codex for ~8-24h per open math/physics research problem. few thoughts:
parallel agents don't seem to scale that cleanly for a lot of problems. many of these are just extremely sequential. you don't really get to "spawn 50 agents and solve it from nowhere." it's more like: tiny move, check, reframe, tiny move, dead end, try again. hours/days of serial cognition, which honestly rhymes with how these fields move over decades.
this updates me a bit against the sci-fi picture of "superhuman math/physics intelligence" as some alien oracle that instantly sees the proof / theory.
the actual superhuman-ness is more mundane and maybe more important: the agent has absorbed a huge prior, can read long papers basically instantly, can think/write at >50 tok/s, and you can clone it across dozens of problems. speed + knowledge volume + multiplicability. that's the superpower.
also: frontier physics seems much more tractable for these agents than decade-old open math problems. for some physics directions, ~8h is enough to get something paper-shaped and nontrivial.
big caveat tho: research taste is still missing. the agent is a pretty good problem-solver, but not yet a top-tier problem-picker. it can push hard once the direction is chosen, but you probably still want a human with taste choosing the problem / framing / bet.
current model: agents are becoming very strong research labor, but the bottleneck shifts upward into taste, problem selection, and knowing which hill is worth climbing.












