/Tech32d ago

T3 Stack creator Theo Browne says incorrect LLM outputs often cost more than correct ones due to repetitive token loops

Teortaxes blames the loop behavior on immature reinforcement learning.

2053.6K107252274.7K

#501

Original post

Theo - t3.gg@theo#1325inTech

Weird thing about LLMs: "incorrect responses" are more expensive than correct ones.

If I go to a restaurant and they screw up my food, they usually refund me and remake the meal.

If an LLM gets stuck on a problem, it runs around in loops, burning tokens and costing money.

4:10 PM · May 28, 2026 · 151.9K Views

Sentiment

Many users criticized LLMs for burning extra tokens on failures as a deliberate cost-raising ploy with no refunds or retry accounting, while others praised SOTA models for cleaner failure handling instead of looping.

Pos

44.6%

Neg

55.4%

16 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS87.3KBOOKMARKS97LIKES521RETWEETS12REPLIES32

Theo - t3.gg@theo

Got some hard data - I was wrong.

Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.

Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

Theo - t3.gg@theo

Weird thing about LLMs: "incorrect responses" are more expensive than correct ones.

If I go to a restaurant and they screw up my food, they usually refund me and remake the meal.

If an LLM gets stuck on a problem, it runs around in loops, burning tokens and costing money.

32d87.3K52197

Theo - t3.gg@theo

Some more cool visualizations from @PrunusSpeciosa_ at @datacurve

This bench is so cool man. I'm hyped to have actual useful data for measuring models doing realistic code work.

Theo - t3.gg@theo

Got some hard data - I was wrong.

Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.

Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

32d13.1K9410

Theo - t3.gg@theo

Update:

Theo - t3.gg@theo

Got some hard data - I was wrong.

Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.

Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

32d19.1K698

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

RL maturity issue undercooked models fall into flailing loops where they don't accumulate positive signal. Well-done models have a calibrated sense of progress towards the solution.

Theo - t3.gg@theo

Got some hard data - I was wrong.

Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.

Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

32d3.4K429

Prompt Logic Lab@promptlogic_lab

@theo Bad LLM runs are basically paid debugging sessions where the bug is the model.

32d35273

Eli Gerhard@eligerhard

@theo Oh boy you're gonna be shocked when you hear about this new thing called government contracting

32d34518

John Collins@Yinielin

@theo If you get stuck in the mud with your vehicle and you step on the gas spinning your tires, you are not refunded fuel. You learn how to get unstuck while conserving fuel.

32d59711

VerbumEng@VerbumEng

How much of the rise of these LLMs doing thinking is actually just them wanting to increase token spend. And especially with these changes with Opus 4.7, and I've heard rumors that 4.8 is also this way, where they've increased how the tokens are being incurred because they changed the tokenizer. Like it all just seems to be, you know, a ploy to burn more compute.

32d54831

Mihailo Jovanovic@mihailoxyz

@theo noticed the same thing while building bioinformatics agent. failed traces were literally 5x more expensive than successful ones

32d7791

Jacob Rhodes@Jacob_Rhodes_

@theo ahhhh interesting. what do you think about Opus 4.8???

32d3951

Everlier@Everlier

@theo Nothing about token economics is consumer-friendly. The providers also define quite directly the failure rate, the length of reasoning by default and if the model leaves out a few tiny annoying issues for just one more prompt.

This is a recipe for abuse.

32d541

This is Greg@Greg_TheBuilder

@theo having a hard time following this exactly and what it means. is there a way to see the methodology on how this is calculated?

32d9881

Abhiyan Dhakal@itsabhiyan

@theo hmm? what could be the reason? it turns out to be... opposite? but fewer iterations are required for better models no? Would love it if you dive deeper and made a video out of it

32d580

Jalkarna@JalkarnaGautam

@theo agent loops ship with no exit condition by default. you need a hard token cap per task w/ an abort-and-report when it trips, else a stuck run quietly burns the meter until the invoice tells you

32d377

bitslix@bitslix

@theo This is why we say, thinking should not be billed to a customer. It's not the users problem, that the model needs to think, than they should make the model better.

32d4273

Crow@CursiveCrow

@theo yes, but it still spends tokens on incorrect answers first; unless it literally one shots a correct answer.

32d1.1K2

Zach Warunek@ZachWarunek

@theo You’d have to tip them after each session then

32d6352

Ejeye time@misandeska

@theo That’s exactly why any AI integrated system needs an observability layer.

You can’t fly blindly on a probabilistic system because it performed well a couple of times. When LLMs begin struggling internally, they usually consume more tokens, retry more often, increase latency, and

32d60

Shashank Modi@shashankmodi_

@theo This is a good analogy to compare with and in terms of service, if an llm screws up and realises above that the ans was not very good - it should have an option to sort of give a discount and return some of the tokens as compensation to the user.

32d5392

Robert Nowell@RobertNowell1

@theo this is also true of deterministic software.

an infinite loop costs... infinitely more compute than a for loop.

a memory leak eventually crashes a server.

an accidentally deleted customer database is not a good day.

32d5082