How do you usually figure out why a multi-GPU training run is slower than expected?

I have been bitten by this a few times recently and realized everyone seems to have a slightly different workflow.

Thinking about the last time a multi-GPU (DDP / FSDP) training run was noticeably slower than you expected:

Genuinely curious how people debug this in practice, because my own process still feels pretty ad-hoc.