even polymarket sees it now
JUST IN: Businesses are shifting from AI “tokenmaxxing” to efficiency, threatening the explosive growth of OpenAI & Anthropic.
Enterprises once raced to burn through as many AI tokens as possible, especially in coding workflows, but are now tightening budgets and demanding measurable returns, a change that could curb the rapid revenue ramps previously forecast for OpenAI and Anthropic.
even polymarket sees it now
JUST IN: Businesses are shifting from AI “tokenmaxxing” to efficiency, threatening the explosive growth of OpenAI & Anthropic.
Firms such as Uber have introduced monthly AI tiers after burning through annual budgets in just four months, while Lindy switched entirely to cheaper open-weight alternatives expecting millions in savings.
Both companies filed confidential IPO documents in early June, just as analysts noted that current growth rates are likely the fastest either will ever see.
Many users welcome businesses like Coinbase shifting from AI tokenmaxxing to efficiency via smarter defaults, routing, and caching because it cuts spend while avoiding waste and focusing on real value.
No Digg Deeper questions have been answered for this story yet.
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.
> Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible.
We do this for you in Deep Agents - see our blog on it here:
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.
@brian_armstrong next step: post-training your own models based on open-source!
How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.
Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaulting to open weight models like GLM 5.2 and Kimi 2.7 through our LLM gateway, while still encouraging engineers to choose the right model for the task. 91% of our employees were never hitting their usage caps, so instead of lowering caps and driving up alerts, we're moving to cheaper defaults. Note that code reviews use a diversity of models, so they can check each other's work.
Better Routing – In our custom harnesses, we preprocess prompts and route to the best model for the job, considering cache hits and model pricing. For instance, you may want a frontier model for planning, but not for execution where they can be overkill. Ultimately, humans shouldn't be choosing models - AI can automate this task.
Better Caching – Cache misses are the easiest way to drive your cost up. All of our requests are cache aware, so we’re reusing a warm cache wherever possible. For example, our cache hit rate went from 5% → 60% in LibreChat once properly implemented.
Keep Context Lean – Start fresh sessions when switching tasks. Scope file context narrowly. Disconnect unused tools. Don't just compact. The goal isn't fewer tokens used, it's fewer tokens wasted.
Better Visibility – Our engineers can use as many tokens as they want, from whatever model they want, but we’ve made usage visible – and the more you spend on AI, the more impact we expect.
The goal isn't to suppress usage. It's to build the infrastructure that makes exponential growth sustainable.
Putting this into practice has cut our AI spend nearly in half, while our token usage continues to grow.
OpenAI and Anthropic face new AI reality as companies shift from tokenmaxxing to efficiency https://www.cnbc.com/2026/06/26/openai-anthropic-new-ai-spending-reality-as-users-shift-to-efficiency.html?taid=6a3e6cfd4493680001f19646&utm_campaign=trueanthem&utm_content=main&utm_medium=social&utm_source=twitter

@brian_armstrong

@CNBC ALFRED's words are absolutely right! We were warned about AI back in 1992, ---"Sounds as if the human race could become quite expendable for AI." Stop AI

@sam90860759 see my essay yesterday on the fizzle. i think maybe it is a fizzle rather than a pop.

@brian_armstrong brian have u checked out @AskSurplus its a inference marketplace on Base built by @mac_eth who you may know
might cut ur costs down!

@GaryMarcus Gary I agree with the ai bubble premise, but we are a little early. I think it pops closer to end of 2027.

@btsouth @brian_armstrong Claude desktop uses the same baseline harness as Claude code with the Claude agent sdk which just runs the Claude code binary. All of these known name harnesses have tool search built in now, it’s table stakes to even be a functional product after last several quarters.

@brian_armstrong this is exactly what @_adamr_1 built @TracerML for

"Disconnect unused tools" is the line most people skip, and it's the biggest lever in here. Every idle MCP server dumps its full tool list into context on every request, whether the agent calls it or not. You pay that tax before typing a word.
It's the whole reason we built @conduitmcp: one gateway that hands the agent 3 meta-tools to search on demand instead of every server's full catalog. Measured ~90% fewer tokens, same results.

@brian_armstrong 💪💪💙💙

@brian_armstrong Cool beans 🫘
@brian_armstrong @ClementDelangue I’m running both of these on mi300x amd and am super happy… imho better than current codex

@brian_armstrong spittin bars

@brian_armstrong @arnavbathla20 For now but it will inevitably 📈

@brian_armstrong Brian, just let it Ride...

@CNBC Dario, you, of all people, should have been able to resist this logic of power. What’s happening is exactly the worst nightmare. An oligarch-emperor who decides who will control the intelligence of the future—the ultimate power. OSP! (open source power ✊)

@brian_armstrong Solid advice, mate. Cutting costs while scaling usage is the real move. Crypto bros been saying this for years.