Ray co-founder Robert Nishihara says Slurm and Ray are complementary after Midjourney's David Holz questions if developers still use Slurm
Slurm handles resource scheduling, while Ray manages distributed runtimes.
@DavidSHolz why do people compare ray and slurm ever
does everyone in AI still use SLURM? or have we moved to RAY? what's going on in cloud orchestration land nowadays?
FWIW, I don't view Ray and Slurm as alternatives to each other, I think of them as solving different problems, e.g.,
Slurm is responsible for sharing compute resources among multiple workloads and multiple users. It provides workload multitenancy, queuing, prioritization, preemption, etc.
Ray is an actor framework and provides a distributed runtime for a single workload. It provides a single-controller programming model for distributed workloads, manages & coordinates processes, handles failures, etc.
It's very natural to run a Ray workload on top of Slurm, similar to how you'd run a Ray workload on top of Kubernetes.
does everyone in AI still use SLURM? or have we moved to RAY? what's going on in cloud orchestration land nowadays?
@DavidSHolz I've written a bit about how I think about the layering. https://www.anyscale.com/blog/ai-compute-open-source-stack-kubernetes-ray-pytorch-vllm
FWIW, I don't view Ray and Slurm as alternatives to each other, I think of them as solving different problems, e.g., Slurm is responsible for sharing compute resources among multiple workloads and multiple users. It provides workload multitenancy, queuing, prioritization, preemption, etc. Ray is an actor framework and provides a distributed runtime for a single workload. It provides a single-controller programming model for distributed workloads, manages & coordinates processes, handles failures, etc. It's very natural to run a Ray workload on top of Slurm, similar to how you'd run a Ray workload on top of Kubernetes.
But to your original question, we see more Kubernetes (versus Slurm), but both are extremely popular. More specifically - Established tech companies have largely standardized on Kubernetes - AI startups are split between Slurm and Kubernetes - They often eventually shutdown the Slurm clusters and move to Kubernetes, but this is a very slow process - For batch jobs (training / data prep), research teams often prefer the Slurm developer experience versus Kubernetes - For running production inference services, Kubernetes is much better
@DavidSHolz I've written a bit about how I think about the layering. https://www.anyscale.com/blog/ai-compute-open-source-stack-kubernetes-ray-pytorch-vllm
@DavidSHolz torchx with k8s is nice
does everyone in AI still use SLURM? or have we moved to RAY? what's going on in cloud orchestration land nowadays?
@DavidSHolz *nice enough
@DavidSHolz torchx with k8s is nice
@DavidSHolz Slurm will always have a special place in my heart, probably not a first choice after a certain scale of both compute and number of people using it
does everyone in AI still use SLURM? or have we moved to RAY? what's going on in cloud orchestration land nowadays?