/Tech34d ago

Aran Komatsuzaki, GPT-J co-leader, says running Codex on complex math problems shows parallel agents fail to scale

Lewis Tunstall suggested using specialized subagents with limited contexts

522681911037.1K

#67

Original post

Aran Komatsuzaki@arankomatsuzaki#67inTech

i've been running Codex for ~8-24h per open math/physics research problem. few thoughts:

parallel agents don't seem to scale that cleanly for a lot of problems. many of these are just extremely sequential. you don't really get to "spawn 50 agents and solve it from nowhere." it's more like: tiny move, check, reframe, tiny move, dead end, try again. hours/days of serial cognition, which honestly rhymes with how these fields move over decades.

this updates me a bit against the sci-fi picture of "superhuman math/physics intelligence" as some alien oracle that instantly sees the proof / theory.

the actual superhuman-ness is more mundane and maybe more important: the agent has absorbed a huge prior, can read long papers basically instantly, can think/write at >50 tok/s, and you can clone it across dozens of problems. speed + knowledge volume + multiplicability. that's the superpower.

also: frontier physics seems much more tractable for these agents than decade-old open math problems. for some physics directions, ~8h is enough to get something paper-shaped and nontrivial.

big caveat tho: research taste is still missing. the agent is a pretty good problem-solver, but not yet a top-tier problem-picker. it can push hard once the direction is chosen, but you probably still want a human with taste choosing the problem / framing / bet.

current model: agents are becoming very strong research labor, but the bottleneck shifts upward into taste, problem selection, and knowing which hill is worth climbing.

9:20 AM · May 26, 2026 · 27.3K Views

Sentiment

Many users praise AI research agents for tackling sequential math and physics problems due to broad knowledge and promising results, while a few object that multiple agents mainly create noise instead of progress.

Pos

88.4%

Neg

11.6%

14 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

physics intern

HUGGINGFACE.COVia

Posts from X

Most Activity

VIEWS6.2KBOOKMARKS5LIKES17RETWEETS1REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I really have no clue what this means given enormous differences in intelligence and work performance between *humans* (presumably we've only got humans… and AIs now). Known upper limit of "human" is very far from any employee reasonable money can buy.

Candide III@CandideIII

Vindication of @anomalyuk's old post on limitations of intelligence. One point of evidence that the best human intelligence is about the best intelligence of our type can get is that it does not scale well. Turns out SOTA AI research agents don't scale well either.

34d6.2K175

Lewis Tunstall@_lewtun

@arankomatsuzaki Not sure if you are already doing this, but you can get much better performance on physics problems by spawning subagents with specific roles and limited context

Leandro von Werra@lvwerra

We released physics-intern: a simple harness for science problems!

It gets models like Gemini 3.1 Pro to go from 17.7 -> 31.4, thus beating GPT 5.5 Pro.

The physics-intern harness can wrap any model and via dedicated subagent boost the performance of the vanilla reasoning models.

While I think more and more of these harness capability gains will be absorbed into the models (like prompting tricks disappeared over time) there is a lot to be gained right now by building good scaffolds for those models and integrating tools well.

Interestingly, the exception we found that GPT 5.5 Pro actually didn't benefit from the physics-intern harness!

Read more about it here: https://huggingface.co/spaces/huggingface/physics-intern

PS: I think the Harness[Model] notation is kind of nice.

34d1.6K124

Aran Komatsuzaki@arankomatsuzaki

@_lewtun Thanks! I should def give a shot. Multi-agent definitely helps, but I just think it saturates rather quickly per problem. But since there are so many problems, I bet cross-problem parallelization is much bigger source scaling.

Lewis Tunstall@_lewtun

@arankomatsuzaki Not sure if you are already doing this, but you can get much better performance on physics problems by spawning subagents with specific roles and limited context

34d55640

Aryaman Arora@aryaman2020

@arankomatsuzaki @pangramlabs ?

Aran Komatsuzaki@arankomatsuzaki

i've been running Codex for ~8-24h per open math/physics research problem. few thoughts:

this updates me a bit against the sci-fi picture of "superhuman math/physics intelligence" as some alien oracle that instantly sees the proof / theory.

also: frontier physics seems much more tractable for these agents than decade-old open math problems. for some physics directions, ~8h is enough to get something paper-shaped and nontrivial.

current model: agents are becoming very strong research labor, but the bottleneck shifts upward into taste, problem selection, and knowing which hill is worth climbing.

33d94610

Alex Strudwick Young@AlexTISYoung

@arankomatsuzaki The biggest advantage these models seem to have over humans is the breadth of their knowledge. They can find an idea from a field you don't know much about that solves your problem.

34d2496

Tom Nicholson@TFWNicholson

@arankomatsuzaki Re parallel agents: tree of thoughts type branching etc? At every decision point, try them all and see what the outcome of that step is, rinse, repeat, and backtrack

34d831

Justin Halford@Justin_Halford_

@arankomatsuzaki We need a way to cache best efforts. I’ve had a lossless image compression project spinning for a couple days and have seen promising results in tree search. There has been some lateral transfer between branches. The unlock is, with tight feedback loops, the iteration cadence.

34d1542

Lewis Tunstall@_lewtun

Yeah the main lesson from @dlouapre's work is that context engineering is crucial when you're dealing with open-ended problems. He's also working on porting the scaffold to a set of skills / tools that can plug into CC / Codex directly - we'll let you know when it's ready if you're keen to test it!

Aran Komatsuzaki@arankomatsuzaki

33d16910

Sean Cantrell@ThePremiseOfIt

@arankomatsuzaki I've been hammering on research in a physics-adjacent space for a while with agents, and I can tell you most of the work is definitely sequential, but not in a "small iterative step" sense. I'm slightly skeptical of the 8hrs -> paper shaped and nontrivial claim. Would love to see

34d1251

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I really have no clue what this means given enormous differences in intelligence and work performance between *humans* (presumably we've only got humans… and AIs now). Known upper ceiling of "human" is very far from any employee reasonable money can buy.

Candide III@CandideIII

34d28000

Tom Nicholson@TFWNicholson

@arankomatsuzaki It also goes against the "agents just keep relentlessly trying" aspect that seems to help so much.

34d242

Yong Zheng-Xin@yong_zhengxin

i personally don’t think research taste matters as much when we have agents churning through all unresolved scientific questions, assuming we have near-infinite inference compute budget.

in other words, all scientific questions become intrinsically valuable and we no longer have to assign arbitrary value due to taste.

34d681

Aran Komatsuzaki@arankomatsuzaki

@bigswingingdong Yeah it scales longer time horizontal than GPT 5.5 Pro, which is great. Thank god ChatGPT Pro gives out generous quota to do this.

34d371

Yash@yash1_

@arankomatsuzaki tldr: "The human bottleneck in science is shifting from execution to curation." true

34d241

Dante@Dante_romas

@teortaxesTex Ps The universe is one pond, no new elements exist out there. Musk thinks travel brings answers, but the same periodic table rules deep in space. If its not on earth its not out there , just different recipes with the same mixed ingredients

33d171

Pangram Labs@pangramlabs

@aryaman2020 @arankomatsuzaki We believe that this document is fully AI-generated

https://www.pangram.com/history/e1340c14-637a-41bc-8f5c-23dbda1d2e35

33d55

Tyler Moore@TylerMooreUS

@arankomatsuzaki Interesting to see how people imagined AI would be. Read another tweet today where the poster was saddened that LLM's were not superior to humans, he seems to have expected angels.

34d54

Ismael Tagle@ismael_tagle

@arankomatsuzaki I think once we discover a way to integrate the power of world model prediction with LLMs in a single architecture, AI models will have the physical intuition required to make true breakthroughs.

34d161

Rooke Poole@rookepoole

@arankomatsuzaki The research taste point is the one. I've been working on a 3D cellular automaton framework for years as an independent researcher. With AI help i got a paper to submission quality. The AI didn't pick the problem. It helped me prove what i already knew was worth finding.

34d51

Candide III@CandideIII

@teortaxesTex The very next paragraph in that post mentions these enormous differences. Give it a read https://www.anomalyblog.co.uk/2012/01/speculations-regarding-limitations-of/

33d48