MIT Lecture Explains Massively Parallel Deep Learning Training Techniques
Most Activity
1 thing I would add is that doing a 1-dim DPxTPxPPxEP mesh limits the degree of EP. For example, EP=64 basically explodes world size.
1 common solution is to have the DP ranks repurposed as EP so there is no need for a separate EP dimension. The problem is now that EP=DP, which limits scalability.
1 thing Megatron does is something called Parallel Folding, where you have 2 different meshes for Attention (DPxTPxCPxPP) and MoE (EPxETPxEDPxPP) which allows for high EP (regardless of DP) freely with separate ETP=1
Great talk on distributed training, nicely covers all the basics