/Tech34d ago

Epoch AI's Jaime Sevilla and Luke Emberson warn that surging token demand will outpace global Blackwell capacity through 2032

The shortage could force developers to deploy smaller models.

223444117034.9K

#667

Original post

Chris Painter#1472

Epoch AI@EpochAIResearch

Are we nearing a compute crunch?

In our latest Gradient Update, @luke__emberson and @Jsevillamol estimate how many tokens all the Blackwell chips on Earth could serve, and compare this to total token demand. Direct comparisons are difficult, but it appears demand is growing much faster than supply.

1:35 PM · May 26, 2026 · 26.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS246BOOKMARKS1REPLIES1

Jaime Sevilla@Jsevillamol

Another important conclusion on the supply side is that inference is not really compute or bandwidth bound. If you have spare resources, engineers will find ways to use them, using tools like speculative decoding and prefill chunking.

34d24661

LIKES8

Jaime Sevilla@Jsevillamol

Unfortunately for token demand we have very limited information, so all we can offer are some proxies for growth. I hope we can come back to the topic in a few months with more information and a clearer conceptual framework.

34d20681

RETWEETS3

Jaime Sevilla@Jsevillamol

Deep dive into token supply and demand! I come away with the impression that there is going to be significant pressure to keep models intended for the general public small in size.

Epoch AI@EpochAIResearch

Are we nearing a compute crunch?

34d4.6K4222

Epoch AI@EpochAIResearch

Our supply estimate is based on serving Kimi K2.6, a trillion-parameter model with 32B active parameters. Using 8k:1k input-to-output token requests, we estimate it would be possible to serve ~20B output tok/s, enough to serve every person on earth 7M tokens per month.

34d1915

Epoch AI@EpochAIResearch

That estimate is highly dependent on the choice of model and inference settings. At longer context lengths, throughput falls substantially. For instance, 128k:1k requests could only be served at around 500M tok/s, a factor of ~50 less.

34d1685

Epoch AI@EpochAIResearch

Our analysis up to here is based only on a rough approximation of a complex process. To calibrate our estimates, we use data from Semianalysis InferenceX, which provides detailed results from real-world inference experiments (including on Kimi 2.5/2.6).

34d1234

Epoch AI@EpochAIResearch

How do our final figures compare to demand? Google just announced that it processes about 1.2B tokens per second (input + output). If we model this volume as 8k:1k requests, that works out to 130M output tokens per second.

Exponential View estimates that Google makes up about 25% of the global demand for tokens. Even if we lavishly insisted on serving all these tokens using expensive trillion-parameter MoE models, that would be enough to serve all current demand.

34d1303

Epoch AI@EpochAIResearch

We build towards this estimate with a simplified model of inference. First, we study prefill and decode, calculating the time for each as a function of batch size. For 8k:1k requests, we find that the prefill phase is compute-bound, while decoding is bandwidth-bound.

34d1303

Epoch AI@EpochAIResearch

But using chunked prefill, we can interleave the prefill and decode phase of different user batches, so that we are only bottlenecked by either compute or memory. We can also use speculative decoding to get multiple output tokens per forward pass with minimal overhead cost.

34d1283

Epoch AI@EpochAIResearch

Available data suggest there are about 14M SWEs using AI daily worldwide. If they matched the intensity of Meta or Apple employees, global throughput demand could reach 200M to 4B tok/s — likely beyond what today's Blackwell fleet could serve at long context lengths.

34d1522

Epoch AI@EpochAIResearch

The Information reported that Meta's 85k employees consumed 60T tokens/month, or ~1M output tokens/day per employee at 25k:1k requests.

Other reports suggest Apple allows some engineering teams to spend $300/day on tokens, enough for 25M Kimi K2.6 output tokens/day per employee.

34d1412

Epoch AI@EpochAIResearch

But growth in supply looks slow relative to trends in demand. Google has experienced a 10×/year increase in tokens processed since 2024. Exponential View finds a similar rate of growth across all providers.

34d1302

Epoch AI@EpochAIResearch

Our simplified analysis misses many factors. Improving capabilities drive demand up, but smaller models also begin to match older, larger ones, displacing some demand. We've also focused on Kimi K2.6; closed frontier models may differ substantially.

34d1621

Epoch AI@EpochAIResearch

The supply of tokens is also steadily increasing. The expansion of AI infrastructure and increases in chip efficiency have expanded global compute by 3.4×/year and memory bandwidth in AI chips by 4.1×/year. Our analysis finds compute growth to be the long-run bottleneck.

34d1401

Herbie Bradley@herbiebradley

@Jsevillamol perhaps an underrated fact that "TAI by 2030" does not just depend on the speed of R&D, but also whether TAI fits in X billion params

probably X is like 10 at most assuming an moe with typical sparsity

Jaime Sevilla@Jsevillamol

Deep dive into token supply and demand! I come away with the impression that there is going to be significant pressure to keep models intended for the general public small in size.

34d18240

Epoch AI@EpochAIResearch

Regardless, our analysis suggests a compute crunch is near, if not already here.

Read the full Gradient Update: https://epoch.ai/gradient-updates/is-a-compute-crunch-coming

34d1633

d123@dfh13571

@Jsevillamol What do you think about thhe implication of this? Will this lead to power and wealth concentration in the short term? How about the current AI discourse, seems like a lot of people still think AI is a stochastic parrot that can't make anything new

34d12

Strata@ChainZenit

@EpochAIResearch @luke__emberson @Jsevillamol wait this compute vs demand thing feels important

34d12