2h ago

Aran Komatsuzaki, GPT-J co-leader, says running Codex on complex math problems shows parallel agents fail to scale

Lewis Tunstall suggested using specialized subagents with limited contexts

0
Original post

i've been running Codex for ~8-24h per open math/physics research problem. few thoughts: parallel agents don't seem to scale that cleanly for a lot of problems. many of these are just extremely sequential. you don't really get to "spawn 50 agents and solve it from nowhere." it's more like: tiny move, check, reframe, tiny move, dead end, try again. hours/days of serial cognition, which honestly rhymes with how these fields move over decades. this updates me a bit against the sci-fi picture of "superhuman math/physics intelligence" as some alien oracle that instantly sees the proof / theory. the actual superhuman-ness is more mundane and maybe more important: the agent has absorbed a huge prior, can read long papers basically instantly, can think/write at >50 tok/s, and you can clone it across dozens of problems. speed + knowledge volume + multiplicability. that's the superpower. also: frontier physics seems much more tractable for these agents than decade-old open math problems. for some physics directions, ~8h is enough to get something paper-shaped and nontrivial. big caveat tho: research taste is still missing. the agent is a pretty good problem-solver, but not yet a top-tier problem-picker. it can push hard once the direction is chosen, but you probably still want a human with taste choosing the problem / framing / bet. current model: agents are becoming very strong research labor, but the bottleneck shifts upward into taste, problem selection, and knowing which hill is worth climbing.

9:20 AM · May 26, 2026 View on X

@arankomatsuzaki Not sure if you are already doing this, but you can get much better performance on physics problems by spawning subagents with specific roles and limited context

Leandro von WerraLeandro von Werra@lvwerra

We released physics-intern: a simple harness for science problems! It gets models like Gemini 3.1 Pro to go from 17.7 -> 31.4, thus beating GPT 5.5 Pro. The physics-intern harness can wrap any model and via dedicated subagent boost the performance of the vanilla reasoning models. While I think more and more of these harness capability gains will be absorbed into the models (like prompting tricks disappeared over time) there is a lot to be gained right now by building good scaffolds for those models and integrating tools well. Interestingly, the exception we found that GPT 5.5 Pro actually didn't benefit from the physics-intern harness! Read more about it here: https://huggingface.co/spaces/huggingface/physics-intern PS: I think the Harness[Model] notation is kind of nice.

3:01 PM · May 21, 2026 · 77.3K Views
4:54 PM · May 26, 2026 · 301 Views