Emad Mostaque says a single researcher can now manage several thousand GPUs for AI training with modern automation
Historically, one researcher managed 1,000 A100 GPUs.
Many users say one researcher can easily manage thousands of GPUs alone because recent hardware like H100/B200 chips and better automation make large solo runs straightforward based on their experience.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher
With h and b chips and new failover you should be able to get to thousands comfortable
On TPUs even more i would think
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

I've heard ~400-500 nodes. depends on what you're training. naively thinking, the more redundancy you have often means the more you can just exclude say if you have an XID error and assuming it can be only be resolved datacenter/compute provider side and you don't have cycling authority.

@snowclipsed what kind of stuff do you hear people running solo on those kind of node counts?

@DavidSHolz very confidently can say RL

@DavidSHolz ive heard only 20 ish people in the whole bay area can bring a whole GB300 rack thru L11 validation, so it depends on whether you count fixing hardware/firmware issues or not

@DavidSHolz if the infra works 8-12 easy. if anything breaks before lunch maybe 2 lol

@sensho I think LLM/agents is fine!

@DavidSHolz it depends if u count the hours they spend debugging slurm configs as part of the answer
id say 4-8 before they burn out
@DavidSHolz If they have coding agents, probably unbounded
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

@DavidSHolz does without any help assume no use of llms/agents? researcher wielding it well prolly changes some stuff

@DavidSHolz All of them.

@DavidSHolz why would there be an upper limit
@EMostaque @DavidSHolz I once used ~90% of the cluster to tokenize videos, before getting kicked for hogging preemptible compute
@DavidSHolz A few years ago we had 1000+ GPUs (so 128 nodes) fine on a100s by a single researcher
With h and b chips and new failover you should be able to get to thousands comfortable
On TPUs even more i would think

@DavidSHolz I don't think there is a limit. I used to manage 8000+ (4x mi250x each ) nodes on a daily basis. You just need the cluster to be correctly setup and the algo to scale on it.(as little between node communication as possible)

@DavidSHolz People should be able to solo 500 to 1000 nodes if the infra is setup properly. Shitty infra maybe 100 nodes is the max
@DavidSHolz I was able to do a few k gpus at once solo, which I found mostly stressful but very empowering
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

@DavidSHolz (which makes a lot of it PD disagg inference)
@unixpickle yah I mean can a single researcher with agents really do a 100k GPU training job?
@DavidSHolz If they have coding agents, probably unbounded

maybe 100 or something its pretty easy yea depends how automated it is i did some big runs on aws sagemaker... very easy... would do it myself these days torch distributed... just manually AI manage them all as pets idk runpod or smth lol... depends how much management yea but with AI agents you can get heaps of automation over it or some kind of rules around it like to save time on the gpus can help like early out stop different training runs if they arent looking promising or yea scale it to zero when its not training... save topk checkpoints... detect grad explosion and restart from latest checkpoint etc... yea just over automate over test and should be fine :) can work your way up from a small amount to more and more as you get more and more confidence in the testing.
i think there will be one person, 1B+ company, one person labs one day doing huge experiments and frontier training runs with huge AI coding agent runs too

I don’t know the upper bounds but i have comfortably handled 16 nodes solo, more of a budget limitations - but I understand I’m more comfortable with operations than most so I think I’m mostly going to tap out on gpu faults. Imagine anything beyond I’d probably need to think more about topology, but assuming no physical faults I can usually figure it out. I will say tho I’m perpetually compute poor so I’d hedge beyond that.
and tend to work entirely solo so don’t have much ‘other people overhead’ wrt to job scheduling. Pref just to have root.