/AI12h ago

MIT Lecture Explains Massively Parallel Deep Learning Training Techniques

47214658.3K

Original posts

#1430

Quote posts

#1430

Original post

wh@nrehiew_#1430inAI

Great talk on distributed training, nicely covers all the basics

7:47 AM · Jun 2, 2026 · 5.5K Views

/AI12h ago

MIT Lecture Explains Massively Parallel Deep Learning Training Techniques

--0--

Original posts

#1430

Quote posts

#1430

Original post

wh@nrehiew_#1430inAI

Great talk on distributed training, nicely covers all the basics

7:47 AM · Jun 2, 2026 · 5.5K Views

Sentiment

Users criticized Megatron's Parallel Folding for MoE training as overly complex, describing scaling compute meshes as headaches and the acronyms as confusing makeshift band-aids.

Pos

0.0%

Neg

100.0%

2 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2.7KBOOKMARKS20LIKES27RETWEETS5REPLIES1

wh@nrehiew_

1 thing I would add is that doing a 1-dim DPxTPxPPxEP mesh limits the degree of EP. For example, EP=64 basically explodes world size.

1 common solution is to have the DP ranks repurposed as EP so there is no need for a separate EP dimension. The problem is now that EP=DP, which limits scalability.

1 thing Megatron does is something called Parallel Folding, where you have 2 different meshes for Attention (DPxTPxCPxPP) and MoE (EPxETPxEDPxPP) which allows for high EP (regardless of DP) freely with separate ETP=1

wh@nrehiew_

Great talk on distributed training, nicely covers all the basics

11h2.7K2720