1d ago

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Historically, one researcher managed 1,000 A100 GPUs.

0
Original post

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

12:34 AM · May 24, 2026 View on X

@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?

Alex NicholAlex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

6:57 PM · May 24, 2026 · 448 Views
7:34 PM · May 24, 2026 · 339 Views

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views
1:21 PM · May 24, 2026 · 3.6K Views

@Clashluke @DavidSHolz It was a good cluster, sadly missed

(well aside from the interconnect ofc, but still)

Lucas NestlerLucas Nestler@Clashluke

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

2:00 PM · May 24, 2026 · 181 Views
8:49 PM · May 24, 2026 · 74 Views

@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views
4:46 PM · May 24, 2026 · 503 Views

@DavidSHolz If they have coding agents, probably unbounded

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 38.3K Views
6:57 PM · May 24, 2026 · 448 Views

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

EmadEmad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think

1:21 PM · May 24, 2026 · 3.6K Views
2:00 PM · May 24, 2026 · 181 Views