/Tech1h ago

Will Brown says serving large Mixture of Experts models economically requires high batch sizes on 32 to 64 GPUs

Story Overview

In a technical exchange about wide expert parallelism for trillion-parameter models, Will Brown noted that large Mixture of Experts systems only become cost-effective once batch sizes climb high enough to justify clusters of 32 or 64 GPUs, tying directly into ongoing conversations on reinforcement-learning infrastructure.

33308654

#573

Original post

will brown@willccbb#573inTech

@Dorialexander pd disagg + wide-ep + kv offloading + cache-aware routing goes a long way

https://www.primeintellect.ai/blog/rl-at-1t-scale

Alexander Doria@Dorialexander

Yeah vllm/sglang do work quite well know, batching ensures a good theoretical throughout but then everyone want to Claude Code and you have to manage X parallel sessions, each with varying latency.

2:13 AM · Jun 27, 2026 · 313 Views

GPU Economics

Cluster scale drives cost curves

Brown’s observation aligns with current Wide-EP practice where communication overhead drops once enough tokens are processed together, yet exact batch-size thresholds for specific models remain unstated in the exchange.

Open Question

Agentic sessions complicate batching

The surrounding thread flags tension between high-throughput batch serving and the variable-latency demands of parallel agentic workloads, leaving open how 32-to-64-GPU setups will handle mixed traffic patterns.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

PRIMEINTELLECT.AIVia

#573

Posts from X

Most Activity

Alexander Doria@Dorialexander

@willccbb I know (great post!). But wide-ep is requiring a large GPU stack?

will brown@willccbb

@Dorialexander pd disagg + wide-ep + kv offloading + cache-aware routing goes a long way

https://www.primeintellect.ai/blog/rl-at-1t-scale

1h9920

LIKES2

will brown@willccbb

@Dorialexander as with all big MoEs, economical serving kicks in at fairly high batch size

32 or 64 gpus is common

Alexander Doria@Dorialexander

@willccbb I know (great post!). But wide-ep is requiring a large GPU stack?

1h9520

REPLIES2

Alexander Doria@Dorialexander

@willccbb Yeah that was the point I intended to develop in next post. As equipment lags, many cos have GPUs lying around but more in the 16-32 h100 range.

1h892

Florian Brand@xeophon

@Dorialexander @willccbb I wonder whether there’s an org which helps you sourcing larger clusters for your needs at great prices

1h492

will brown@willccbb

@Dorialexander guessing we get something close to glm-5.2-ish in the 400-500b range in the next few months? our results there were h200 fp8, should translate to something a bit smaller

tuning is fiddly but doable

1h462