/AI12h ago

MIT Lecture Explains Massively Parallel Deep Learning Training Techniques

--0--
Original posts
Quote posts
Original post
wh@nrehiew_#1430inAI

Great talk on distributed training, nicely covers all the basics

7:47 AM · Jun 2, 2026 · 5.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS2.7KBOOKMARKS20LIKES27RETWEETS5REPLIES1
wh@nrehiew_

1 thing I would add is that doing a 1-dim DPxTPxPPxEP mesh limits the degree of EP. For example, EP=64 basically explodes world size.

1 common solution is to have the DP ranks repurposed as EP so there is no need for a separate EP dimension. The problem is now that EP=DP, which limits scalability.

1 thing Megatron does is something called Parallel Folding, where you have 2 different meshes for Attention (DPxTPxCPxPP) and MoE (EPxETPxEDPxPP) which allows for high EP (regardless of DP) freely with separate ETP=1

wh@nrehiew_

Great talk on distributed training, nicely covers all the basics

11hViews 2.7KLikes 27Bookmarks 20