18h ago

T3 Stack creator Theo Browne says incorrect LLM outputs often cost more than correct ones due to repetitive token loops

Teortaxes blames the loop behavior on immature reinforcement learning.

0
Original post

Weird thing about LLMs: "incorrect responses" are more expensive than correct ones. If I go to a restaurant and they screw up my food, they usually refund me and remake the meal. If an LLM gets stuck on a problem, it runs around in loops, burning tokens and costing money.

4:10 PM · May 28, 2026 View on X

RL maturity issue undercooked models fall into flailing loops where they don't accumulate positive signal. Well-done models have a calibrated sense of progress towards the solution.

Theo - t3.ggTheo - t3.gg@theo

Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

12:16 AM · May 29, 2026 · 80.1K Views
12:21 AM · May 29, 2026 · 3.2K Views

Update:

Theo - t3.ggTheo - t3.gg@theo

Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

12:16 AM · May 29, 2026 · 80.1K Views
12:16 AM · May 29, 2026 · 17.4K Views

Got some hard data - I was wrong.

Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE.

Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

Theo - t3.ggTheo - t3.gg@theo

Weird thing about LLMs: "incorrect responses" are more expensive than correct ones. If I go to a restaurant and they screw up my food, they usually refund me and remake the meal. If an LLM gets stuck on a problem, it runs around in loops, burning tokens and costing money.

11:10 PM · May 28, 2026 · 135.3K Views
12:16 AM · May 29, 2026 · 80.1K Views

Some more cool visualizations from @PrunusSpeciosa_ at @datacurve

This bench is so cool man. I'm hyped to have actual useful data for measuring models doing realistic code work.

Theo - t3.ggTheo - t3.gg@theo

Got some hard data - I was wrong. Had Datacurve run the numbers for "tokens used by pass/fail" for DeepSWE. Bad models use way more tokens in fail cases, but SOTA models are much closer. GPT 5.5 used ~7% MORE tokens on correct answers!

12:16 AM · May 29, 2026 · 80.1K Views
12:17 AM · May 29, 2026 · 12.3K Views