13h ago

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Historically, one researcher managed 1,000 A100 GPUs.

3517034433.7K

——0——

Original post

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

12:34 AM · May 24, 2026

#156David@DAVIDSHOLZ

@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?

Alex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

6:57 PM · May 24, 2026 · 125 Views

7:34 PM · May 24, 2026 · 69 Views

#167Emad@EMOSTAQUE

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views

1:21 PM · May 24, 2026 · 2.4K Views

#167Emad@EMOSTAQUE

@Clashluke @DavidSHolz It was a good cluster, sadly missed

(well aside from the interconnect ofc, but still)

Lucas Nestler@Clashluke

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

2:00 PM · May 24, 2026 · 122 Views

8:49 PM · May 24, 2026 · 5 Views

#263Andrew Carr 🤸@ANDREW_N_CARR

@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views

4:46 PM · May 24, 2026 · 314 Views

#565Alex Nichol@UNIXPICKLE

@DavidSHolz If they have coding agents, probably unbounded

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views

6:57 PM · May 24, 2026 · 125 Views

#1709Lucas Nestler@CLASHLUKE

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

Emad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think

1:21 PM · May 24, 2026 · 2.4K Views

2:00 PM · May 24, 2026 · 122 Views

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Cluster engagement

Sentiment