š New blog: Heterogeneous CPU + GPU EPD Disaggregation to Boost VLM Serving, with Intel Xeon CPUs offloading vision encoding to cut TTFT and boost throughput.
Vision encoding is the bottleneck in image-heavy VLM serving. Offloading it to CPUs changes that. By using SGLang EPD disaggregation + Dynamo device-aware weighted router + @Intel AMX on Xeon 6747P, we achieved: ā 1.2-1.3Ć lower P99 TTFT & higher request throughput ā 1.3-30Ć lower P99 TPOT ā Extra ROI on top of pure GPU EPD disaggregation, at near-zero added cost
Thanks to @inteldevs for the collaboration on this!