1d ago

Runway Adopts DTensor to Prevent Silent Gradient Bugs in Distributed Training

0
Original post

Distributed training is hard. We adopted DTensor at Runway to prevent silent gradient bugs and it delivered. But we traded performance for correctness, hitting dispatch overhead, recompilation storms, and MFU drops. Wrote up what we learned and how we work around it. https://runwayml.com/news/dtensor-distributed-training

11:47 AM · May 18, 2026 View on X