@Dorialexander pd disagg + wide-ep + kv offloading + cache-aware routing goes a long way
https://www.primeintellect.ai/blog/rl-at-1t-scale
Yeah vllm/sglang do work quite well know, batching ensures a good theoretical throughout but then everyone want to Claude Code and you have to manage X parallel sessions, each with varying latency.


