Machine Learning Community | Digg

12h

Vision Transformers with Self-Distilled Registers, NeurIPS 2025

12h

How do you usually figure out why a multi-GPU training run is slower than expected?

I have been bitten by this a few times recently and realized everyone seems to have a slightly different workflow. Thinking about the last time a multi-GPU (DDP / FSDP) training run was noticeably slower than you expected: What did you suspect first? How did you narrow it down

You’ve reached the end of the feed.

Roll credits.

Vision Transformers with Self-Distilled Registers, NeurIPS 2025

How do you usually figure out why a multi-GPU training run is slower than expected?

You’ve reached the end of the feed.

About

Community Guidelines

Founded by