/Tech37d ago

Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation

Historically, one researcher managed 1,000 A100 GPUs.

4220175743.9K

#117

Original post

David@DavidSHolz#117inTech

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

12:34 AM · May 24, 2026 · 38.3K Views

Sentiment

Many users say one researcher can easily manage thousands of GPUs alone because recent hardware like H100/B200 chips and better automation make large solo runs straightforward based on their experience.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS3.8KBOOKMARKS3LIKES18RETWEETS1

Emad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

36d3.8K183

REPLIES2

snow@snowclipsed

I've heard ~400-500 nodes. depends on what you're training. naively thinking, the more redundancy you have often means the more you can just exclude say if you have an XID error and assuming it can be only be resolved datacenter/compute provider side and you don't have cycling authority.

37d1.4K162

David@DavidSHolz

@snowclipsed what kind of stuff do you hear people running solo on those kind of node counts?

37d91931

snow@snowclipsed

@DavidSHolz very confidently can say RL

37d33031

outside five sigma@jwt0625

@DavidSHolz ive heard only 20 ish people in the whole bay area can bring a whole GB300 rack thru L11 validation, so it depends on whether you count fixing hardware/firmware issues or not

36d25011

Alex UGift@Radipdegen

@DavidSHolz if the infra works 8-12 easy. if anything breaks before lunch maybe 2 lol

37d6691

David@DavidSHolz

@sensho I think LLM/agents is fine!

37d4431

Lumin@luminxbt

@DavidSHolz it depends if u count the hours they spend debugging slurm configs as part of the answer

id say 4-8 before they burn out

37d532

Alex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

36d52800

sensho@sensho

@DavidSHolz does without any help assume no use of llms/agents? researcher wielding it well prolly changes some stuff

37d483

Trevor Blackwell@tlbtlbtlb

@DavidSHolz All of them.

36d661

Sam Foreman@saforem2

@DavidSHolz why would there be an upper limit

36d200

Lucas Nestler@Clashluke

@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute

Emad@EMostaque

@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher

With h and b chips and new failover you should be able to get to thousands comfortable

On TPUs even more i would think

36d18800

DHD@DHDev0

@DavidSHolz I don't think there is a limit. I used to manage 8000+ (4x mi250x each ) nodes on a daily basis. You just need the cluster to be correctly setup and the algo to scale on it.(as little between node communication as possible)

36d91

Hensen Juang@basedjensen

@DavidSHolz People should be able to solo 500 to 1000 nodes if the infra is setup properly. Shitty infra maybe 100 nodes is the max

36d2113

Andrew Carr 🤸@andrew_n_carr

@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering

David@DavidSHolz

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

36d61120

snow@snowclipsed

@DavidSHolz (which makes a lot of it PD disagg inference)

37d2622

David@DavidSHolz

@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?

Alex Nichol@unixpickle

@DavidSHolz If they have coding agents, probably unbounded

36d38710

Lee Penkman@LeeLeepenkman

maybe 100 or something its pretty easy yea depends how automated it is i did some big runs on aws sagemaker... very easy... would do it myself these days torch distributed... just manually AI manage them all as pets idk runpod or smth lol... depends how much management yea but with AI agents you can get heaps of automation over it or some kind of rules around it like to save time on the gpus can help like early out stop different training runs if they arent looking promising or yea scale it to zero when its not training... save topk checkpoints... detect grad explosion and restart from latest checkpoint etc... yea just over automate over test and should be fine :) can work your way up from a small amount to more and more as you get more and more confidence in the testing.

i think there will be one person, 1B+ company, one person labs one day doing huge experiments and frontier training runs with huge AI coding agent runs too

37d2321

Adam Hibble@Algomancer

I don’t know the upper bounds but i have comfortably handled 16 nodes solo, more of a budget limitations - but I understand I’m more comfortable with operations than most so I think I’m mostly going to tap out on gpu faults. Imagine anything beyond I’d probably need to think more about topology, but assuming no physical faults I can usually figure it out. I will say tho I’m perpetually compute poor so I’d hedge beyond that.

and tend to work entirely solo so don’t have much ‘other people overhead’ wrt to job scheduling. Pref just to have root.

36d2181