Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation
Historically, one researcher managed 1,000 A100 GPUs.
@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?
@DavidSHolz If they have coding agents, probably unbounded
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher
With h and b chips and new failover you should be able to get to thousands comfortable
On TPUs even more i would think
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)
@Clashluke @DavidSHolz It was a good cluster, sadly missed
(well aside from the interconnect ofc, but still)
@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute
@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)
@DavidSHolz If they have coding agents, probably unbounded
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)
@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think