Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation
Historically, one researcher managed 1,000 A100 GPUs.
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher
With h and b chips and new failover you should be able to get to thousands comfortable
On TPUs even more i would think
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)
@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think