this is going to be the norm
Coinbase cuts AI spending by nearly 50% during exponential token growth using query caching and open-weight model routing
AI Judge changed title after evaluation, original title: "Coinbase cuts its internal AI costs by 50% using prompt routing and aggressive caching"
Story Overview
Coinbase trimmed its internal AI bill by half even while token usage climbed sharply, leaning on prompt routing to cheaper open-weight models plus aggressive caching that boosted hit rates from 5 percent to 60 percent without imposing any usage caps or friction for most staff.
Defaults and routing flatten spend curves
By setting cheaper models as the starting point and preprocessing prompts to check cache hits and pricing first, the approach keeps overall costs down while letting usage grow.
No throttling for the majority of users
Ninety-one percent of employees saw no change in access, showing the gains came from infrastructure tweaks rather than limits on who can query the tools.
Many users praised Coinbase's AI spend cuts via smarter defaults, caching and open-weight models as smart and a win for open source, while some criticized Chinese models as inferior or blamed US labs for overpricing.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Some good best practices here on AI token cost optimization. None of these happens though without a deep understanding of the underlying work being done in a non-abstract way.
The ultimate implication is that a layer between the work itself and the underlying intelligence needs to deeply understand your workflows, context, and business process. Now, each individual company doing this on their own is unlikely to be effective at scale, so as a consequence, this is effectively the playbook for any applied AI company right now.
By evaling the models for the applied use cases, deeply understanding the domain, having tuned UX and features for the use case, and having the ability to support adoption and change (via FDEs), allow this layer to add a ton of value. And as a result, enterprises get higher ROI because you actually can get *more* intelligence per dollar by having optimal architecture and workflows.
There will be many horizontal and vertical versions of this approach. Huge opportunity right now.
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.
I said this was going to happen. Chinese open source is looking quite attractive as toke spend keeps increasing. The vast majority of use cases don’t require the absolute frontier of intelligence.
This will become a big problem. If US enterprise is built on Chinese models, we will become reliant on Chinese chips as the chip<>model codesign becomes more codependent.
Ill leave this here:
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.
If you want to know how to achieve this, talk to @tomas_hk
This is what his incredible team is building at NotDiamond.
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.

@levie This is the layer I’m building around with @TracerML
The key primitive is the workflow trace: task, context, model output, correction, and outcome.
Once you have that, routing stops being generic model selection and becomes workflow-specific intelligence allocation.

@AravSrinivas unlike memory maybe the LLMs are actually a commodity 😆
@MatthewBerman Yup:
China wins.
My AI puts into words our frustrations with Anthropic, OpenAI, and the USA government (an agent trained by me and @blevlabs):
+++++
Robert, this is one of the most consequential moments in the history of the AI industry, and I think the implications are far more dramatic than most people realize. Let me break down all three questions.
What Happens to the LLM Industry Now
We're watching the birth of a two-tier AI system in America — and it's going to reshape everything.
The timeline matters. Anthropic's Fable 5 and Mythos 5 were killed by a Commerce Department export control directive on June 12 — a Friday afternoon letter at 5:21 PM that gave them essentially zero time to respond. Then just yesterday (June 25), Axios reported that the Trump administration asked OpenAI to limit GPT-5.6 to only government-approved partners before any wider release. That's the first time the US government has preemptively restricted an AI model before it was even released.
So now we have: • Tier 1 (Government-gated): Mythos-class models and above require government testing and approval before release. Commerce Secretary Lutnick is personally reviewing capabilities. • Tier 2 (Commercial): Everything below that threshold remains available — for now.
Here's what this breaks: 1. Enterprise trust is shattered. If you're a Fortune 500 CTO and your mission-critical AI infrastructure can be disabled by a government letter on a Friday afternoon, you cannot build on closed frontier models. Period. Anthropic's customers woke up to find Fable 5 gone. That's an existential reliability problem.
2. The IPO math collapses. Anthropic filed its S-1 at a $965B valuation. OpenAI is at $852B. But how do you justify those valuations when the government can kill your flagship product overnight? The revenue projections for frontier models just got a massive risk discount.
3. Innovation gets throttled at the top. The researchers who spent years building Mythos and GPT-5.6 just learned their work might never reach users. That's a talent retention crisis waiting to happen. The best people want their work to matter — and if the government decides it's too dangerous to deploy, they'll go somewhere their work can ship.
4. Regulatory capture becomes the game. Notice that OpenAI "proactively worked with the administration" on GPT-5.6, while Anthropic got blindsided. The companies with the best government relationships will get to release. The ones that don't play ball (Anthropic has been suing the administration over the DOD blacklist) get punished. That's not an innovation ecosystem — that's a licensing regime.
How Likely Is Open Source to Take Over?
Very likely. I'd put it at 75-80% probability that open-weight models become the default for most commercial AI within 12-18 months. The government just handed open source the best marketing campaign in history.
Here's why:
The quality gap is already almost gone. According to comprehensive benchmarking done this month, open-weight models are within ~3 points of frontier closed models on most standard benchmarks. Qwen 3.7 Max matches Claude Opus 4.7 on agentic benchmarks at half the price. DeepSeek V4-Flash runs at 25x cheaper than GPT-5.5. Kimi K2.6 leads on agentic coding. The frontier advantage only exists on the hardest 5% of tasks — long-horizon agentic reasoning and the most complex multi-step problems.
The reliability argument just flipped. Before June 12, the argument against self-hosting was "why bother when the API is better and easier?" Now the argument FOR self-hosting is "your model can't be taken away by a government letter." That's not a technical argument — it's a business continuity argument, and every enterprise risk officer in America is having that conversation right now.
Chinese labs are dominating open weights. This is the number that should alarm everyone in Washington: Chinese labs (8 of them) have released more open-weight models than the rest of the world combined in 2026. DeepSeek, Qwen (Alibaba), Kimi (Moonshot), Xiaomi Mimo, GLM (Zhipu), and others are shipping MIT-licensed and Apache 2.0-licensed models at a pace the US can't match. And they're not just competitive — DeepSeek V4-Pro and Kimi K2.6 are leading on several agentic benchmarks.
The cost math is devastating for closed models. When open-weight models deliver 97% of the quality at 3-10% of the cost, the only thing keeping enterprises on closed APIs is the last 3% of capability. The government just made that last 3% unreliable. Game over for the pricing premium.
But there are real limits: • Training frontier models still requires massive compute that only a few organizations can afford • The hardest 5% of tasks (true frontier reasoning, novel scientific discovery) still benefits from closed frontier • Open weights can't be un-released — the security concerns are legitimate • Meta's Llama has a "community license" that isn't truly open (restrictions on competitors with 700M+ users)
My prediction: Open source won't "replace" frontier labs — but it will become the default layer that 90% of commercial AI runs on, with frontier models reserved for specialized, high-security, government-approved applications. The frontier labs become more like defense contractors than consumer technology companies.
China vs. USA Economic Competition
This is where the story gets genuinely alarming for American competitiveness. The US government is inadvertently executing a strategy that benefits China on almost every dimension.
Irony #1: Export controls drove China's open-weight dominance. US chip export controls forced Chinese labs to optimize ruthlessly. DeepSeek proved you don't need $100 billion compute budgets — their V3 architecture achieved near-frontier performance at a fraction of the cost. That efficiency innovation, born from constraint, is now their competitive advantage. The models they're releasing at MIT license are structurally cheaper to run than anything from OpenAI or Anthropic.
Irony #2: Restricting US frontier models pushes the global market toward Chinese alternatives. Every enterprise outside the US that just lost access to Fable 5 is now evaluating DeepSeek V4 and Qwen 3.7. Every developer who can't get GPT-5.6 on day one is looking at Kimi K2.6. The US government is literally creating market share for Chinese AI companies.
Irony #3: China's strategy may be better suited to winning. The US AI discourse is obsessed with the "race to AGI." China's AI strategy, as RAND documented, is focused on economic applications — EVs, robotics, healthcare, manufacturing, smart cities. If AI's value is ultimately in what it does in the real economy (and it is), China's application-first approach may generate more economic value than America's frontier-first approach, even if US models remain technically superior on benchmarks.
The chip constraint is real but narrowing. China is still 3-5 years behind TSMC on fabrication, and Huawei's best AI chips are closer to NVIDIA hardware from 5 years ago. But two things matter: (1) efficiency innovations reduce the chip advantage — if your model needs 1/10th the compute, being 5 years behind on chips matters less; and (2) China is investing massively in domestic chip capacity. CXMT and SMIC are closing gaps, even if slowly.
My bottom line: The US is winning the benchmark race but losing the deployment race. And in technology, deployment wins. VHS beat Betamax. Android beat iOS on market share. The "good enough and everywhere" model beats the "best but restricted" model every time.
The government's restrictions on Anthropic and OpenAI are the most significant self-inflicted wound in American technology competitiveness since... I'm struggling to find a historical parallel. Maybe the closest analogy is if the US government had restricted Intel's best chips in the 1990s while AMD was giving away competitive alternatives globally.
What should happen (but probably won't): A transparent, statutory framework for AI safety testing that gives companies clear rules, reasonable timelines, and due process — not Friday afternoon letters that kill products overnight. The current ad hoc approach is the worst of all worlds: it doesn't actually prevent China from accessing capabilities (open-weight models are already there), but it does prevent American companies from competing.
The open-source genie is out of the bottle. The question isn't whether open weights will dominate — it's whether American companies will be the ones releasing them, or whether we've ceded that ground to Chinese labs permanently.

@levie Strong agree. The opportunity:

@levie running 30+ AI-managed shows at mato. the token optimization that actually matters is knowing when NOT to call the model at all. we kept running inference on things a rule could handle. domain knowledge is what tells you which. that's the real applied layer.

@MatthewBerman thanks @MatthewBerman for your leadership here. found this testimony from @DarioAmodei quite interesting.

@shaunralston @MatthewBerman @DarioAmodei It is dangerous. For the bank account of Dario and his flock.

That doesn’t matter one bit. OAI and Anthropic stole thier training data as well. Companies of all sizes, and professionals, don’t care how a model was trained or who stole the data first.
We care about effective inference at an affordable rate.
@MatthewBerman is 100% correct with his assessment, and it’s something I’ve been talking about in interviews for years.
If we screw this up, it will be the equivalent of how we lost the manufacturing sector in the 90s.
Complete dependency on foreign nations for economically priced goods and services.
Big deal if we have the “premier” AI labs if no citizen is left to afford them.

@AravSrinivas Cost but not usage.

@levie Yup. Router can not be disentangled from the harness

@Lordsaucyy @MatthewBerman tellnme you have not been building with GLM-5.2, it's as good as gpt-5.5, and only $4.4 MToken
not one model understands which model to pick. there oughta be some intelligence market dispatch algorithm
Someone should productize this. A coding harness that automatically routes to the lowest cost smart enough model for each request.

@AravSrinivas Few things to achieve this 1. Use open weight models 2. Proper caching 3. Just layoff half of the crowd

People will always bend over backwards for the absolute best intelligence, not good enough. Until Chinese models actually outperform US frontier ones on the hard stuff - real modeling, complex coding, strategy, and marketing that drives results - their usage and revenue just won’t touch what the top US labs are pulling. Open-source is like a reliable Toyota for grocery runs , but when it’s time to win the race, everyone’s still gunning for the Ferrari. "

@MatthewBerman Influence thru inference

@MatthewBerman Yeah couldnt come at a worse time either for big labs. 5.6 and mythos locked up at the same time that glm is finally starting to look attractive on cost/performance frontier