Anyscale releases Ray 2.56 with Ray Serve optimizations that match Rust-based routers and boost throughput up to 24.8x

Original post

Super excited about the launch of these new performance optimizations built on Ray and vLLM. This is a major milestone for the next-generation open-source AI infrastructure stack.

Seiji Eicher@seiji_________

Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns.

In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

9:36 AM · Jun 18, 2026 · 1.9K Views

Blog | Anyscale

ANYSCALE.COMVia

VIEWS3.6KBOOKMARKS4LIKES27

Robert Nishihara@robertnishihara

Ray + vLLM is faster now

ray@raydistributed

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!

🚀Three major optimizations: - Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker - A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling - HAProxy ingress, for ingress request routing at the speed of C

All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

6h3.6K274

RETWEETS8

vLLM@vllm_project

Huge milestone from the @anyscalecompute + @googlecloud GKE teams 🎊

Ray Serve LLM provides up to 4.4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads than previous versions.

Three optimizations made this possible on the Ray Serve LLM + vLLM stack: ⭐️Direct streaming with a control-plane-only endpoint picker ⭐️ A new vLLM Ray V2 executor backend ⭐️HAProxy ingress for routing at the speed of C

Ray's primitives for fault tolerance, observability, and portability across K8s and VMs are a great foundation as inference deployments get more complex.

Congrats to the team! Try the new Ray V2 executor today in vLLM with --distributed-executor-backend ray.

Seiji Eicher@seiji_________

In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

7h4K4912

Anyscale@anyscalecompute

Ray Serve LLM hits a major milestone: up to 4.4x higher throughput on prefill-heavy & 24.8x on decode-heavy workloads vs. baseline, now matching Rust-based vllm-router while keeping Ray's fault tolerance & portability.

How we did it in partnership with @Google: https://na2.hubs.ly/H069hh-0

7h2.1K288

Simon Mo@simon_mo_

Great work! Amazing to see Ray Serve LLM and @vllm_project are ever closer together! When done right, @raydistributed is ever flexible, extensible, and highly performant.

Seiji Eicher@seiji_________

In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

6h2.4K140

PyTorch@PyTorch

Ray (@raydistributed) Serve LLM and @vllm_project enable high performance distributed inference at scale. Awesome to see Foundation-hosted projects working together to advance the open source AI stack.

Learn more: https://www.anyscale.com/blog/high-performance-distributed-inference-ray-serve-llm-vllm-google-kubernetes-gke

ray@raydistributed

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!

All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

6h4.9K265

Seiji Eicher@seiji_________

@simon_mo_ @vllm_project @raydistributed Thank you @simon_mo_!

6h10

Seiji Eicher@seiji_________

@istoica05 Thank you, @istoica05!

6h8

Sharoon Irfan@sharoon_irfan

@anyscalecompute @GoogleCloudTech @Google 24.8x on decode-heavy workloads is a significant number. Would love to see how it holds on mixed batches with variable sequence lengths, since that's what real production traffic actually looks like rather than clean benchmarks.