How do you usually figure out why a multi-GPU training run is slower than expected?
I have been bitten by this a few times recently and realized everyone seems to have a slightly different workflow.
Thinking about the last time a multi-GPU (DDP / FSDP) training run was noticeably slower than you expected:
What did you suspect first?
How did you narrow it down?
Did it end up being data, comms, imbalance, something else?
Roughly how long did it take before you felt confident about the root cause?
Genuinely curious how people debug this in practice, because my own process still feels pretty ad-hoc.
0 Comments