OpenAI reportedly found new inference optimizations that more than halved the cost of running its models!
According to The Information, engineers told colleagues this month that the techniques helped power ChatGPT for visitors without free or paid accounts using only a couple hundred Nvidia GPUs at one point.
The exact method is unclear. It could involve quantization, KV caching, batching, routing simpler queries to cheaper models, or some mix of all of those.
The business angle is bigger than the technical detail: OpenAI ended Q1 with a 39% gross margin and wants to reach 52% by year-end. Lower inference costs give it room to either improve margins, raise ChatGPT usage limits, or cut API pricing pressure on developers.
OpenAI's moat is increasingly becoming inference and cost advantage, especially against Anthropic.