1d ago

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Historically, one researcher managed 1,000 A100 GPUs.

4019975743.1K

——0——

Original post

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

12:34 AM · May 24, 2026

#156David@DAVIDSHOLZ

@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?

Alex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

6:57 PM · May 24, 2026 · 448 Views

7:34 PM · May 24, 2026 · 339 Views

#167Emad@EMOSTAQUE

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views

1:21 PM · May 24, 2026 · 3.6K Views

#167Emad@EMOSTAQUE

@Clashluke @DavidSHolz It was a good cluster, sadly missed

(well aside from the interconnect ofc, but still)

Lucas Nestler@Clashluke

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

2:00 PM · May 24, 2026 · 181 Views

8:49 PM · May 24, 2026 · 74 Views

#263Andrew Carr 🤸@ANDREW_N_CARR

@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views

4:46 PM · May 24, 2026 · 503 Views

#565Alex Nichol@UNIXPICKLE

@DavidSHolz If they have coding agents, probably unbounded

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views

6:57 PM · May 24, 2026 · 448 Views

#1709Lucas Nestler@CLASHLUKE

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

Emad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think

1:21 PM · May 24, 2026 · 3.6K Views

2:00 PM · May 24, 2026 · 181 Views

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Sentiment

Cluster engagement