/Tech2d ago

Oliver Sieberling of MIT shows dynamic short convolutions improve Transformer performance up to 2B parameters using custom Triton kernels

The architecture also improves Mixture-of-Experts and Mamba-2.

274486327674.3K

#62

Original post

Alexia Jolicoeur-Martineau#516

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

5:59 AM · Jun 8, 2026 · 49.8K Views

/Tech2d ago

Oliver Sieberling of MIT shows dynamic short convolutions improve Transformer performance up to 2B parameters using custom Triton kernels

The architecture also improves Mixture-of-Experts and Mamba-2.

274486327674.3K

#62

Original post

Alexia Jolicoeur-Martineau#516

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

5:59 AM · Jun 8, 2026 · 49.8K Views

Sentiment

Many users found Dynamic Short Convolutions for Transformers refreshing because the work emphasizes practical wall-clock performance with Triton kernels instead of just FLOPs.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS19.8KBOOKMARKS66LIKES134REPLIES12

Lucas Beyer (bl16)@giffmana

This work looks interesting, basically like @ZeyuanAllenZhu Canon layer but the conv weights are input dependent.

Which sounds horrible for perf, but they focus a lot on this and include triton kernels as well as consider wall clock time in their scaling experiments.

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

1d19.8K13466

RETWEETS46

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

2d49.8K291204

Oliver Sieberling@osieberling

Triton Kernels: https://github.com/OliverSieberling/dynamic-conv1d Paper: https://arxiv.org/abs/2606.03825

2d998209

Han Guo@HanGuo97

Very clever + cool idea that works!

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

1d3.1K134

Oliver Sieberling@osieberling

A static short convolution is a depthwise-separable causal convolution with small kernel width (typically W=4). We make it dynamic by letting the filter depend on the current time step. Each token therefore selects its own local convolution filter.

2d1.5K181

Oliver Sieberling@osieberling

We motivate dynamic short convolutions by the need for context-dependent local composition in language. E.g. "the old can opener" vs. "the old can swim"

In particular, when applied to QKV, dynamic short convolutions can compose local context into multi-token keys for retrieval.

2d1.2K191

Oliver Sieberling@osieberling

Naive PyTorch implementations of dynamic convolutions are veryyy slow. The main bottleneck is memory traffic from materializing/moving intermediate tensors.

We address this with custom Triton GPU kernels that perform competitively with CUDA kernels for static short convolutions.

We provide two kernels:

- General/Head-wise: shares filters across groups of channels, which shrinks convolution weight tensor (main I/O bottleneck) and improves latency - Low-rank: fuses an up-projection and produces the dynamic convolution weights on-chip, without materializing in HBM.

2d96719

Oliver Sieberling@osieberling

If we place dynamic short convolutions after every linear layer (instead of just after the qkv-projection), the gains are even larger.

Fitting scaling laws for this variants gives an approximate 1.60x compute advantage over compute-matched transformers.

2d821141

alex peysakhovich@alex_peys

adding small local convs for information pooling operators is great because the convs act like a dynamic tokenizer, this is a cool result and hope it holds up at scale!

Oliver Sieberling@osieberling

New paper 🧵

We show that dynamic short convolutions consistently improve Transformers across scales. We make these gains practical with an efficient parameterization and custom Triton GPU kernels.

The improvements carry over to MoEs and linear attention variants (Mamba-2/GDN).

1d1.6K102

Oliver Sieberling@osieberling

We train language models of various scales (150M-2B params) and apply dynamic short convolutions to Q, K, and V before the attention. We find that this significantly improves language modeling across scales.

Fitting scaling laws suggests an approximate 1.33x compute advantage.

2d91714

Oliver Sieberling@osieberling

We integrate our Triton kernels into lm-engine and measure end-to-end training throughput on an H100.

Adding dynamic convolutions on QKV is only ~7% slower, and therefore the 1.33x compute advantage translates into a significant wall-clock time advantage.

2d83014

Oliver Sieberling@osieberling

Our recipe transfers beyond standard transformers:

Modern linear RNNs already use static short convolutions on the queries, keys, and values. Replacing them with dynamic short convolutions substantially improves language modeling performance for both Mamba-2 and Gated DeltaNet.

2d78010

Oliver Sieberling@osieberling

Joint work with @bharatrunwal2 @rpanda89 @yoonrkim, supported by @MITIBMLab.

2d7928

Oliver Sieberling@osieberling

@Ali_NT99 not quite, the convolution kernel itself is input-dependent, so each token performs a different (learned) convolution. Through this you could potentially learn a dynamic kernel size, but the approach is more general/powerful than just this.

2d38531

Ali Naeimi@Ali_NT99

@osieberling Congrats on the release! So this is basically canon_layers with dynamic kernel size right?

2d43321

oso@osoleve

@giffmana @ZeyuanAllenZhu This sounds really promising in the causally masked regime, from a linguistics perspective, but interestingly my intuition is it would be harmful in an RNN or at least destabilizing when it comes to sentences that require a full parse to be unambiguous

1d70

Blissy@BlissyOnX

@giffmana @ZeyuanAllenZhu input-dependent weights + triton kernels and wall clock focus

actually refreshing to see perf taken seriously in a paper like this

1d77

Alex YGift@Radipdegen

@giffmana @ZeyuanAllenZhu lowkey respect that they accounted for wall time instead of just flops. that alone separates the real from the theater

1d55

Strata@ChainZenit

@giffmana @ZeyuanAllenZhu that sounds like a massive engineering headache, how do they optimize?

1d43

Eclipse 🌖@ECLresearch

@giffmana @ZeyuanAllenZhu The input-dependent weights are a massive perf tax, but if their Triton kernels close the gap enough to beat vanilla transformers on wall-clock at scale, that’s the real signal — otherwise it’s just a theoretical curiosity.

1d40