FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.
SemiAnalysis posted data from 174,264 agentic coding sessions showing 42% of runtime on CPU tasks versus 58% on GPU inference and highlighted cloud pricing mismatches with per-token monetization
Median per-turn time measured 5.13 seconds.
Some users are enthusiastic about optimization opportunities for agentic coding tools due to bottlenecks in existing apps, while others criticize the analysis as misguided or promotional and highlight local versus cloud distinctions.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Here is one big reason why this matters. Time spent on non-LLM inference tasks is only going to increase. However, tools that these AI system use are *very* inefficient and have been built from the ground up for CPU and human use. There is a huge untapped opportunity there to significantly improve those processes with AI agents in mind from the ground up.
FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.
Very important.
FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.
Here is one big reason why this matters. Time spent on non-LLM inference time is only going to increase. However, tools that these AI system use are *very* inefficient and have been built from the ground up for CPU and human use. There is a huge untapped opportunity there to significantly improve those processes with AI agents in mind from the ground up.
FACT ALERT 🚨 : In modern agentic coding, 42% of the time is spent on CPU doing tool use such as editing files, running Bash scripts, running lints, etc. The economy of traditional cloud computing charges at $ per cpu core. In the economy of agents, the business model is $ per token thus to increase token revenue, you need to increase the amount of CPUs power u have so that you can generate your tokens.

@SemiAnalysis_ The wild part is we spend months shaving microseconds off attention kernels, just for the agent to sit idle for 2 seconds waiting for a bash script to return stdout. Amdahl's law is undefeated.

@SemiAnalysis_ ran a similar breakdown last month.
Sen et al swapped the harness on 116 agent tasks. same retriever, same model, 4x spread in tool calls. your 42% is a floor, not a ceiling.

This tracks with what I see daily. I run agentic coding on a MacBook M1 and it handles it surprisingly well — because the real bottleneck isn't GPU horsepower, it's CPU doing file ops, bash, linting, all the tooling between turns.
58% CPU time means the best dev machine for agents isn't the one with the fattest GPU. It's the one you can work on for 10 hours straight from anywhere.
Portability > raw compute for this workflow.

@SemiAnalysis_ The latency of an OS context switching is tiny compared to LLM speed, you can probably run 100 agents on 2 vCPUs given current LLM latencies without issue
A low level agent dispatch framework could handle process switches quite gracefully on a large cluster

@GoodmanAric

@tunguz Baffles the mind that we are using Json files and .MD files instead of designing custom primatives for these tasks, such a 256k GPU direct blocks that can be streamed directly from the NVMe to the GPU without CPU processing.

This is the part most people are missing: agentic coding isn’t just faster it changes the shape of the work itself. When the cost center moves from compute to actions, the entire stack reorganizes. You stop optimizing for raw horsepower and start optimizing for loop efficiency, context design, and tool orchestration.
The real leverage won’t come from bigger models. It’ll come from tighter systems.

@tunguz extremely. Also signals nee hardware necessary.

@SemiAnalysis_ Bash is extremley slow, before it wouldn't matter since the latency was paid by the attention of a human/coffee break. Now that you have agents the economy fundamentally changes, single percentage improvment in grep can translate into millions of dollars fleet-wide.

@SemiAnalysis_ Stupid take to pump CPU stocks.
Why look at time and not utilization. Most of the CPU time is blocked by IO so buying more CPUs won't even help

@SemiAnalysis_ Besides, nobody is going to give CPU time for agents for free. So it is $/T + $/core

@tunguz @grok how different would you have to do to be very efficient?

@SemiAnalysis_ We’re too focused on inference. What about tool latency, I/O, and scheduling? That’s a big part of the stack now.

@SemiAnalysis_ this is the part people miss when they only optimize the model inference layer. the agent spends half its life in a shell, and a cold or slow sandbox taxes every one of those tool calls. the environment the agent runs in is as much a perf lever as kv cache

@SemiAnalysis_ @sailresearchco is building persistent CPU sandboxes for long horizon tasks, Worth checking out
https://www.sailresearch.com/news/introducing-sailboxes-persistent-sandboxes

@SemiAnalysis_ this is your funniest post to date, probably ever, i know you didn’t mean it but this is hilarious, good job

@SemiAnalysis_ Calling it “CPU time” is misleading. That 2.15s is mostly I/O (file reads/writes), not CPU computation. Real bottleneck is tool execution + I/O latency.