OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs · Digg

/Tech1h ago

OpenAI internal optimization cuts inference costs by half, running logged-out ChatGPT traffic on a couple hundred GPUs

Story Overview

OpenAI quietly found an optimization that cuts inference costs in half on the models it touched, and the first visible payoff showed up in logged-out ChatGPT traffic where GPU count dropped to just a few hundred. The work came from engineers squeezing more out of existing servers rather than chasing new chips, a detail that stayed internal until The Information surfaced it.

37229223412K

Original post

Andrew Curran@AndrewCurran_#682inTech

OpenAI has found a way to cut inference costs in half.

Stephanie Palazzolo@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

8:05 AM · Jun 30, 2026 · 1.9K Views

Cost Pressure

Fewer chips, same answers

The change only targeted logged-out traffic so far, leaving open how much headroom remains for the rest of the service or other products.

Open Question

The quiet race nobody tweets about

Anthropic and Google are chasing the same server-level gains, yet no public benchmarks or code have appeared, so the exact trick and its broader applicability stay unknown for now.

Sentiment

Positive users call OpenAI's optimizations that halve ChatGPT inference costs a major efficiency leap, while negative users blame the changes for recent drops in model quality.

Pos

68.7%

Neg

31.3%

12 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

OpenAI Discovers New Way to Cut Inference Costs in Half

THE INFORMATIONVia

OpenAI Discovers New Way to Cut Inference Costs in Half

Posts from X

Most Activity

VIEWS2.9KBOOKMARKS13LIKES65RETWEETS6

Chubby♨️@kimmonismus

OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!

According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.

The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.

The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.

OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.

29m2.9K6513

REPLIES5

Lisan al Gaib@scaling01

yet they can't cut costs for users in half

intelligence too cheap to meter has died long ago when they saw their first billion

Stephanie Palazzolo@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

1h1.2K122

Stephanie Palazzolo@steph_palazzolo

More on the optimization + what it could mean for OpenAI's gross margins or usage limits here:

https://www.theinformation.com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half

1h2.2K196

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

wow did they discover speculative decoding or something? margins go up again!

Stephanie Palazzolo@steph_palazzolo

OpenAI engineers earlier this month developed an optimization that cut inference costs in half for models it was applied to.

After the optimization was applied to logged-out ChatGPT traffic, it reduced the number of GPUs needed to power that traffic to a couple hundred.

52m76792

Nathan Lambert@natolambert

@AndrewCurran_ probably under a narrow-ish set of cirmcumstances or something, and then it gets reported like this

Andrew Curran@AndrewCurran_

OpenAI has found a way to cut inference costs in half.

1h330140

Solipsnitsyn@solipsnitsyn

@steph_palazzolo ohh so that's why 5.5 became so catastrophically stupid

1h93117

Michael@MichaelStolarz

@steph_palazzolo worded a bit poorly >visitors who didn't have a free or paid account what kind of account type is left then? i'm guessing grant/research/trial? would be better to know certainly instead of inferring

1h1.2K1

🍓🍓🍓@iruletheworldmo

@AndrewCurran_ do you know if this is part of the reported price wars?

1h8345

Loquitur Ponte Sublicio@loquitur_ponte

@AndrewCurran_ How we train and run AI is probably really really inefficient given the make it up as we went process.

Lot of gains to come on running the existing physical tech better / algorithmic improvements alone...

1h5031

cheaty@cheatyyyy

@AndrewCurran_ awfully convenient timing is all i'm going to say, not doubting OpenAI at all but this is a hilarious coincidence

what better way to cut inference costs in half than to double throughput

1h1385

Lisan al Gaib@scaling01

@RitsFur you mean like charging 30$ for a model that costs like 2$ to serve?

1h901

The Hero of KVcache@HeroOfKVcache

@steph_palazzolo >it's quantization again

1h7906

StolenAngel@MoisasADR

@AndrewCurran_ Did they explain how? My concern is that they might label some form of system quantification as "optimization."

1h1352

Paweł J Lisowski@PawelJLisowski

@kimmonismus Getting harder and harder for anthropic to justify those prices. I dont think AI industry altogether is bubble, but both anthropic and openai both are starting to feel like it.

26m972

theo@crthpl_

@MichaelStolarz @steph_palazzolo no, people who are not logged in at all who just go to the website

1h952

Jessica Lessin@Jessicalessin

Um this seems big. @steph_palazzolo

https://www.theinformation.com/articles/openai-discovers-new-way-cut-inference-costs-half?utm_source=ti_app&rc=hwneun

1h37910

🍓🍓🍓@iruletheworldmo

@steph_palazzolo nice

1h1.2K3

Chubby♨️@kimmonismus

https://www.theinformation.com/newsletters/ai-agenda/openai-discovers-new-way-cut-inference-costs-half?rc=bfliih

Chubby♨️@kimmonismus

OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!

According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.

The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.

The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.

OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.

29m89130

Andrew Curran@AndrewCurran_

@natolambert Yes, need much more information.

Nathan Lambert@natolambert

@AndrewCurran_ probably under a narrow-ish set of cirmcumstances or something, and then it gets reported like this

1h28040

Chillguy@RitsFur

@scaling01 They're the only "frontier" lab thats working actively to improve cost vs performance. google is just struggling to keep up while Anthropic is making their models expensive as hell so they can get their fat margins.

1h801