let's see if deep-ep can be ported to 4-gpu nodes.
Alexander Doria@Dorialexander
too hot to get out, time to rewrite kernels
7:23 AM · Jun 27, 2026 · 937 Views
The CUDA-graphable implementation is optimized for full recomputation.
let's see if deep-ep can be ported to 4-gpu nodes.
too hot to get out, time to rewrite kernels
No Digg Deeper questions have been answered for this story yet.
@Dorialexander Just leaving this here https://github.com/pytorch/torchtitan/pull/3561
let's see if deep-ep can be ported to 4-gpu nodes.