/Tech3h ago

Integrating dynamic short convolutions improves Transformer performance across scales using custom Triton GPU kernels

The gains also transfer to Mixture-of-Experts and linear-attention architectures

142815019946.6K

#153

Original post

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

5:59 AM · Jun 8, 2026 · 44.8K Views

/Tech3h ago

Integrating dynamic short convolutions improves Transformer performance across scales using custom Triton GPU kernels

The gains also transfer to Mixture-of-Experts and linear-attention architectures

142815019946.6K

#153

Original post

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

5:59 AM · Jun 8, 2026 · 44.8K Views

Sentiment

Users praise the Dynamic Short Convolutions paper for its efficiency gains that carry across Transformers, MoEs, and Mamba variants.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5K

Oliver Sieberling@osieberling

A static short convolution is a depthwise-separable causal convolution with small kernel width (typically W=4). We make it dynamic by letting the filter depend on the current time step. Each token therefore selects its own local convolution filter.

1d1.5K181

BOOKMARKS9LIKES20RETWEETS4

Oliver Sieberling@osieberling

Triton Kernels: https://github.com/OliverSieberling/dynamic-conv1d Paper: https://arxiv.org/abs/2606.03825

1d998209

REPLIES2

Oliver Sieberling@osieberling

Naive PyTorch implementations of dynamic convolutions are veryyy slow. The main bottleneck is memory traffic from materializing/moving intermediate tensors.

We address this with custom Triton GPU kernels that perform competitively with CUDA kernels for static short convolutions.

We provide two kernels:

- General/Head-wise: shares filters across groups of channels, which shrinks convolution weight tensor (main I/O bottleneck) and improves latency - Low-rank: fuses an up-projection and produces the dynamic convolution weights on-chip, without materializing in HBM.

1d96719

Oliver Sieberling@osieberling

We motivate dynamic short convolutions by the need for context-dependent local composition in language. E.g. "the old can opener" vs. "the old can swim"

In particular, when applied to QKV, dynamic short convolutions can compose local context into multi-token keys for retrieval.

1d1.2K191

Oliver Sieberling@osieberling

If we place dynamic short convolutions after every linear layer (instead of just after the qkv-projection), the gains are even larger.

Fitting scaling laws for this variants gives an approximate 1.60x compute advantage over compute-matched transformers.

1d821141

Oliver Sieberling@osieberling

We train language models of various scales (150M-2B params) and apply dynamic short convolutions to Q, K, and V before the attention. We find that this significantly improves language modeling across scales.

Fitting scaling laws suggests an approximate 1.33x compute advantage.

1d91714

Oliver Sieberling@osieberling

We integrate our Triton kernels into lm-engine and measure end-to-end training throughput on an H100.

Adding dynamic convolutions on QKV is only ~7% slower, and therefore the 1.33x compute advantage translates into a significant wall-clock time advantage.

1d83014

Oliver Sieberling@osieberling

Our recipe transfers beyond standard transformers:

Modern linear RNNs already use static short convolutions on the queries, keys, and values. Replacing them with dynamic short convolutions substantially improves language modeling performance for both Mamba-2 and Gated DeltaNet.

1d78010

Oliver Sieberling@osieberling

Joint work with @bharatrunwal2 @rpanda89 @yoonrkim, supported by @MITIBMLab.

1d7928

Oliver Sieberling@osieberling

@Ali_NT99 not quite, the convolution kernel itself is input-dependent, so each token performs a different (learned) convolution. Through this you could potentially learn a dynamic kernel size, but the approach is more general/powerful than just this.

1d38531

Ali Naeimi@Ali_NT99

@osieberling Congrats on the release! So this is basically canon_layers with dynamic kernel size right?

1d43321

Lucas Beyer (bl16)@giffmana

@osieberling Nice that you included torch.compile timings!

1d6421

kaio ken@kaiokendev1

@osieberling Noam'd again

1d2671

Clark@clark__labs

@osieberling this is the kind of efficiency work that quietly compounds: better kernels, less wasted motion, and gains that carry across architectures. not flashy, very real.

1d219

EB1A Experts@eb1aexperts

@osieberling Impressive result. The fact that the gains carry across Transformers, MoEs, and Mamba variants makes this particularly interesting.

15h87

That AI Guy@LewisWeldtech

@osieberling

7h76

𝕎00t@wmertens

@osieberling @Ali_NT99 Does this mean you could also make a byte level model with this approach?

13h13