/Tech2h ago

Yoav Goldberg, AI2-Israel Research Director, argues ReAct loops using frontier models can cost $15 and take 40 minutes per task

Story Overview

Yoav Goldberg highlights how frontier models, despite handling ReAct-style agent loops with tools quite effectively, generate steep expenses and delays that reach roughly fifteen dollars, forty minutes, and a trillion tokens for a single task, raising doubts about whether this setup scales sensibly for routine work.

1336274.6K

#89

Original post

(((ل()(ل() 'yoav))))👾@yoavgo#95inTech

we are now at a situation where "frontier models" with tools can work quite well within a ReAct loop (or similar), but take like 15$, 40 minutes and a trillion tokens to perform a single task.

are we ok with this balance?

my aesthetic is that we can an should be much more efficient with smaller models, less reasoning, and better harnesses, and that this will be a better solution.

but am I alone in this?

6:28 AM · Jun 21, 2026 · 4.1K Views

Cost Pressure

Smaller models could trim the bill

Graham Neubig suggests swapping in mid-sized options such as Qwen-3.6-35B or similar variants paired with tighter harnesses, noting these may cut costs without losing too much capability, though direct head-to-head numbers for the cited tasks are not supplied.

Open Question

Task specifics stay out of reach

No details surface on the exact tasks, models, or loop lengths behind the fifteen-dollar figure, leaving open whether such extremes appear often or only in edge cases, and whether reduced reasoning steps would preserve output quality.

Sentiment

Positive users focus on advancing cheaper hardware and better tokenization to cut frontier model costs in agent loops, while negative users criticize sloppy code and incentives that encourage wasteful token burning.

Pos

60.0%

Neg

40.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS698LIKES1RETWEETS1REPLIES1

Graham Neubig@gneubig

@yoavgo Have you tried Minimax-M2.7 (240B) or even qwen-3.6-35B? Not to mention GLM 5.2, which is larger but still presumably smaller than opus and GPT.

I think that there are plenty of options along the cost efficiency spectrum.

(((ل()(ل() 'yoav))))👾@yoavgo

we are now at a situation where "frontier models" with tools can work quite well within a ReAct loop (or similar), but take like 15$, 40 minutes and a trillion tokens to perform a single task.

are we ok with this balance?

my aesthetic is that we can an should be much more efficient with smaller models, less reasoning, and better harnesses, and that this will be a better solution.

but am I alone in this?

1h69810

BOOKMARKS1

(((ل()(ل() 'yoav))))👾@yoavgo

@Ish4Eye i think these are all different directions. i am mostly interested in the agentic stack part, but i don't see a lot of (current) work that attempt to go beyond a react loop, not in the past few months (there used to be many more)

2h341

Yishai@Ish4Eye

@yoavgo What makes you feel alone in this? It seems to me that the entire field is focused on optimizing exactly that

2h12811

Yishai@Ish4Eye

@yoavgo i.e. building advanced os models that can run on cheaper hardware and perform on SOTA level Or by creating new SOTA hardware Or improving the agentic stack, memory and reasoning This field is advancing faster than any other ever

2h371

🏴‍☠️jake@vokaysh

@yoavgo are we getting there fast enough to disincentivize scaling this?

2h135

(((ل()(ل() 'yoav))))👾@yoavgo

@Ish4Eye is it? where?

2h104

(((ل()(ل() 'yoav))))👾@yoavgo

@gneubig i think my issue is a bit elsewhere: its the "reasoning" part which appears to be wasteful and an overkill for many things. yes, with high reasoning we can now do a lot, but could we do it also with much less reasoning and a smarter control harness?

Graham Neubig@gneubig

@yoavgo Interestingly we found that larger more expensive models are actually faster at solving tasks due to their sophisticated inference methods and good parallel tool calling.

27m3200

(((ل()(ل() 'yoav))))👾@yoavgo

@vokaysh getting where, scaling what?

2h94

david tolpin@dtolpin

Precise AGENTS.md, detailed and tested skills, task-specific extensions is what takes the cost down and decreases conversion time.

Some environments are more suitable for this than others. claude/codex/pi with a frontier model and Maya/Blender adapter pose a rigged model into sitting with one hand supporting the chin and the other scratching the thigh in 15 minutes and $20.

pi with opus 4.8 (just an example), 500 lines of AGENTS.md, 7 skills 1600 lines total, and two extensions (170 lines of typescript) do this (nothing is specific to sitting in the harness, any pose) in 3 minutes and $3.

I think this is called programming.

2h59

Graham Neubig@gneubig

@yoavgo Interestingly we found that larger more expensive models are actually faster at solving tasks due to their sophisticated inference methods and good parallel tool calling.

1h56

Liran Ringel@liranringel

@yoavgo Depends on the task. To cure cancer? Worth it. To optimize a CUDA kernel? I'd spend even more.

2h39

Leo Boytsov@srchvrs

@yoavgo @Ish4Eye I am pretty sure a lot of people in the industry are working on optimizing existing agentic loops.

32m101

(((ل()(ل() 'yoav))))👾@yoavgo

@gneubig I am also not talking about coding/dev tasks, but various search-related situations we are exploring, where the frontier models work, but are extremely wasteful in compute (and $$), compared to our expectations.

(((ل()(ل() 'yoav))))👾@yoavgo

25m2600

Dan Ofer (✈️ ICML-26 Was @Worldcon)@danofer

@yoavgo You forgot godawful amounts of sloppy overlong code. (Before you make them self-reflect + self review enough).

1h15

Ferbin@Ferbin08

@yoavgo Lots of options.

But what's your speed budget? That's usually the real constraint, not cost efficiency.

12m1

Reef Menaged@ReefMenaged

As long as burning tokens is seen as a sign that the programmer "worked harder" , model providers have little incentive to optimize reasoning efficiency, especially if it comes at the cost of benchmark performance. As a result, we'll keep seeing models burn tens of thousands of tokens on relatively simple problems.

1h1

TechGeekDavid@techpupparent

@yoavgo Not alone on this. Tokenization is core to the problem. We compress and emit tokens at the same level regardless of information density. In a ReAct loop this waste compounds. Smaller models with better harnesses is the right direction.

1h1