PufferLib 4.0 multigpu is pretty good. 6.6x scaling on 8 GPUs worst-case with a 32x param policy. >100M sps peak on 8x RTX5090
4:06 PM · Jun 21, 2026 · 3.1K Views
PufferLib 4.0 multigpu is pretty good. 6.6x scaling on 8 GPUs worst-case with a 32x param policy. >100M sps peak on 8x RTX5090
No Digg Deeper questions have been answered for this story yet.
The weights and grads are in contiguous memory, so synchronization is a single nccl reduce. No additional overhead for coalescing etc. This is just a perf test, breakout already trains in 4 seconds with 1 GPU.
PufferLib 4.0 multigpu is pretty good. 6.6x scaling on 8 GPUs worst-case with a 32x param policy. >100M sps peak on 8x RTX5090