Naive PyTorch implementations of dynamic convolutions are veryyy slow. The main bottleneck is memory traffic from materializing/moving intermediate tensors.
We address this with custom Triton GPU kernels that perform competitively with CUDA kernels for static short convolutions.
We provide two kernels:
- General/Head-wise: shares filters across groups of channels, which shrinks convolution weight tensor (main I/O bottleneck) and improves latency
- Low-rank: fuses an up-projection and produces the dynamic convolution weights on-chip, without materializing in HBM.