13h ago

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Historically, one researcher managed 1,000 A100 GPUs.

0
Original post

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

12:34 AM · May 24, 2026 View on X

@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?

Alex NicholAlex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

6:57 PM · May 24, 2026 · 125 Views
7:34 PM · May 24, 2026 · 69 Views

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views
1:21 PM · May 24, 2026 · 2.4K Views

@Clashluke @DavidSHolz It was a good cluster, sadly missed

(well aside from the interconnect ofc, but still)

Lucas NestlerLucas Nestler@Clashluke

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

2:00 PM · May 24, 2026 · 122 Views
8:49 PM · May 24, 2026 · 5 Views

@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views
4:46 PM · May 24, 2026 · 314 Views

@DavidSHolz If they have coding agents, probably unbounded

DavidDavid@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

7:34 AM · May 24, 2026 · 31K Views
6:57 PM · May 24, 2026 · 125 Views

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

EmadEmad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher With h and b chips and new failover you should be able to get to thousands comfortable On TPUs even more i would think

1:21 PM · May 24, 2026 · 2.4K Views
2:00 PM · May 24, 2026 · 122 Views