Northwestern's Zihan Wang introduces BAGEN, finding frontier LLM agents consistently fail to predict and manage their token budgets · Digg

/Tech31d ago

Northwestern's Zihan Wang introduces BAGEN, finding frontier LLM agents consistently fail to predict and manage their token budgets

Early stopping cuts agent operational costs up to 64%

3940860175542.7K

Original post

Zihan "Zenus" Wang@wzenus#1927inTech

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

9:27 AM · May 29, 2026 · 519.9K Views

Sentiment

Positive users praised the BAGEN study on LLM agents lacking budget awareness as valuable or a game-changer for efficient deployment, while negative users dismissed the findings on structured failures as simplistic or misattributed.

Pos

71.4%

Neg

28.6%

7 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

BAGEN: Are LLM Agents Budget-Aware?

RAGEN-AI.GITHUB.IOVia

Posts from X

Most Activity

VIEWS8.4KBOOKMARKS21LIKES56RETWEETS11

Manling Li@ManlingLi_

Budget-aware Agents (BAGEN) study the failure modes in budget estimation:

1. Strong agents are not strong budget estimators.

2. Frontier models are often overoptimistic.

3. Budget awareness is actionable and trainable. SFT plus RL strengthens early stop and alert behavior, saving 28-64 percent of tokens on failed trajectories.

4. Upper and lower bound calibration remains hard.

https://ragen-ai.github.io/bagen/

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d8.4K5621

REPLIES6

Yohei@yoheinakajima

models underestimate how much work it takes (token usage) to accomplish a task, just like us

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d4.5K188

Jiaxin Pei@jiaxin_pei

Most real-world tasks run under a budget. Human agents know when to stop, ask for more, or change plans. But what about AI agents? Check out our new study on the budget awareness of AI agents👇

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d2.9K2411

Cody Blakeney@code_star

I’m BAGEN you to stop

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

30d2.7K103

Leigh Drogen@LDrogen

Been struggling with this, my OpenClaw is supposed to be choosing which model to use for which task to limit unnecessary spend but I get the feeling, as is evidenced by this paper, that it does a poor job

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d1.4K33

Andrey Fradkin@AndreyFradkin

@wzenus Cool stuff! You may want to cite our earlier work. https://arxiv.org/abs/2604.23897

31d49952

Yohei@yoheinakajima

@wzenus cool study. would be very helpful if they were great at estimating required token budget

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d1.4K31

Jinyan Su@SuJinyan6

Humans have the natural instinct to do constrained optimization based on the resources available, how about agents?

Zihan "Zenus" Wang@wzenus

🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?

Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them. 👇

31d1.2K51

Zihan "Zenus" Wang@wzenus

🏠 Project Page: http://ragen-ai.github.io/bagen 📄 Paper: http://ragen-ai.github.io/bagen/bagen.pdf 💻 Code: http://github.com/mll-lab-nu/BAGEN 📊 Data: http://huggingface.co/datasets/MLL-Lab/BAGEN

31d7671

Zihan "Zenus" Wang@wzenus

First of all, what is budget awareness? A budget-aware agent (BAGEN) doesn't just "spend less", but estimates remaining budget mid-task with uncertainty.

We formalize this with progressive interval estimation: 📌 can agents estimate budget mid-task? 📌 can agent show uncertainty with intervals [lo, hi]?

31d1401

Zihan "Zenus" Wang@wzenus

There is a lot more to be done with Budget-Aware Agents (BAGENs). After doing part of the work and realizing it can't finish within budget, a BAGEN should:

① Ask for more budget early ② Cut losses and switch to another task ③ Hand off to a stronger agent

These motivate lots of directions for future work!

31d1141

Zihan "Zenus" Wang@wzenus

What we found: 1/ Budget awareness ≠ task performance (r ≈ 0.35).

The best-performing model is NOT the best budget estimator.

On SWE-bench: Opus leads task success, Gemini leads feasibility F₁, GPT-5.2 leads interval coverage. Three different winners, three different capabilities.

31d701

Zihan "Zenus" Wang@wzenus

@AndreyFradkin Great work. Thanks for your recommendation!

31d1394

Zihan "Zenus" Wang@wzenus

3/ Failure is recognized too late to act on.

Models predict "feasible" above 70% even after 60% of budget is consumed. The alarm fires only in the final 20%.

But good news is, stopping after impossible predictions saves 28–64% of wasted tokens at only 1.6–4.2% success cost.

31d311

Zihan "Zenus" Wang@wzenus

2/ All models are universally too optimistic.

Most of 20 model-task pairs underestimate remaining budget. Weaker models are MORE optimistic.

The bias doesn't shrink with task progress.

31d311

Zihan "Zenus" Wang@wzenus

4/ Budget awareness is trainable, but hits a ceiling.

SFT alone raises Qwen-7B feasibility 25.5% → ~90% (calibration problem ✅).

Interval coverage caps near 47% after SFT+RL. Half of intervals still miss the true remaining budget (reasoning problem ❌).

31d301

Doğaç@dogacel0

I found time estimation on coding especially overly conservative. I ask how long a refactoring would take than the agent says 1.5 days and we usually finish it within 1-2 hours.

I think there is some bias coming from the training data considering it could really take around 1.5 days if AI didn’t exist.

31d251

Zihan "Zenus" Wang@wzenus

@yoheinakajima Yes! Budget-awareness would be a missing ability that people should hillclimb :)

Yohei@yoheinakajima

@wzenus cool study. would be very helpful if they were great at estimating required token budget

31d22530

Henry Zhang@thehenryinsf

@wzenus do they fail because they cannot price actions, or because they keep pretending the plan still works

31d161

JPHZ@juanpab52171869

@wzenus @grok explícame esto

30d27