/Tech18h ago

PyTorch torchtitan pull request ports Deep-EP to 4-GPU nodes for dropless Mixture-of-Experts dispatching

The CUDA-graphable implementation is optimized for full recomputation.

012081.1K

Original post

Alexander Doria@Dorialexander#1537inTech

let's see if deep-ep can be ported to 4-gpu nodes.

too hot to get out, time to rewrite kernels

7:23 AM · Jun 27, 2026 · 937 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

GITHUBVia

Posts from X

Most Activity

VIEWS195BOOKMARKS7LIKES6

@Dorialexander Just leaving this here https://github.com/pytorch/torchtitan/pull/3561

let's see if deep-ep can be ported to 4-gpu nodes.

4h19567