T3 Stack creator Theo Browne says incorrect LLM outputs often cost more than correct ones due to repetitive token loops
Teortaxes blames the loop behavior on immature reinforcement learning.
RL maturity issue undercooked models fall into flailing loops where they don't accumulate positive signal. Well-done models have a calibrated sense of progress towards the solution.
Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!
Update:
Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!
Got some hard data - I was wrong.
Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.
Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

Weird thing about LLMs: "incorrect responses" are more expensive than correct ones. If I go to a restaurant and they screw up my food, they usually refund me and remake the meal. If an LLM gets stuck on a problem, it runs around in loops, burning tokens and costing money.
Some more cool visualizations from @PrunusSpeciosa_ at @datacurve
This bench is so cool man. I'm hyped to have actual useful data for measuring models doing realistic code work.
Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

