/Tech6h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

105911196.1K

#166

Original post

Omar Khattab#166

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 4.4K Views

/Tech6h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

105911196.1K

#166

Original post

Omar Khattab#166

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 4.4K Views

Sentiment

Users praised the fused Triton kernels release for MaxSim scoring in ColBERT and ColPali because of strong community collaboration, helpful PR reviews, and impressive visuals from the contributors.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.7KBOOKMARKS7LIKES15RETWEETS1

Raphaël Sourty@raphaelsrty

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.

It's available in PyLate, it will accelerate both training and inference of multi-vector models

pip install "pylate[lik]"

so cool, from @tonywu_71 and @Aurelien_L_

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

4h1.7K157

REPLIES1

Tony Wu@tonywu_71

@PonyRoi's flash-maxsim and @ErikKaum's MaxSim kernel tackled the same problem independently: fused Triton and hand-written CUDA + Metal respectively.

Really nice work, and very cool to see the community converging on the right ideas (13/N)

6h744

Tony Wu@tonywu_71

Full repo (Apache-2.0, pip install late-interaction-kernels):

https://github.com/hcompai/late-interaction-kernels (12/N)

6h3721

Tony Wu@tonywu_71

⚡ So we don't build it. Stream doc tiles through on-chip SRAM (the GPU's fast scratchpad), keep a running max in registers, never write the grid to HBM. Same outer-product tiling as FlashAttention, softmax swapped for a plain max. (4/N)

6h444

Tony Wu@tonywu_71

Late-interaction models represent queries and docs as sets of token embeddings. Scoring = compare every query token to every doc token, keep the max per query token. That's MaxSim.

The naive path materializes the full [Nq·Nd·Lq·Ld] grid in GPU memory, then takes the max. (2/N)

6h853

Tony Wu@tonywu_71

LIK started as a side project with my friend and brilliant colleague @Aurelien_L_. It was so much to build!

Glad to be part of the community effort to push late-interaction costs toward zero 🙌🏼

(14/N)

6h593

Tony Wu@tonywu_71

Tsm @raphaelsrty & @antoine_chaffin (PyLate) and @MaceQuent1 & @ManuelFaysse (colpali-engine) for helping with the PR reviews for the integration! (10/N)

6h403

Tony Wu@tonywu_71

Both late-interaction training already come with day-0️⃣ LIK support in their latest releases:

pip install "pylate[lik]" pip install "colpali-engine[lik]"

No training-loop changes. Runs on CUDA sm_75+ (Turing/Ampere/Hopper) and Apple Silicon (MPS). (9/N)

6h383

Tony Wu@tonywu_71

📈 Full ColQwen2 (colqwen2-base) and PyLate (GTE-ModernColBERT-v1) fine-tunes land on vanilla loss curves step for step. Same accuracy. Freed memory → larger batch sizes on the same GPU.

Full benchmark results: https://github.com/hcompai/late-interaction-kernels/blob/main/docs/benchmarks.md (8/N)

6h363

Tony Wu@tonywu_71

The grid is ~0.5 GB per 1k docs at ColPali scale (128 query tokens × 1024 page patches). In contrastive training (Nq = Nd = B) it grows as B², preventing larger batch sizes. (3/N)

6h722

Tony Wu@tonywu_71

Mathematically identical to naive. No approximation and numerically equivalent to PyTorch (fp32 accumulators, parity tested). Backward is also fused. (5/N)

6h422

Tony Wu@tonywu_71

⏩ Not writing the grid is faster too, at matched numerics:

• 1.7–16× on reranking / inference (longer Ld → higher end) • 5.0–6.9× on MaxSim (fwd + bwd) in PyLate's cached-contrastive loss • long-context (Ld ≥ 8k): shapes the naive path can't run

(7/N)

6h392

Tony Wu@tonywu_71

💾 In real ColQwen2 training with colpali-engine (H100, LoRA + grad-checkpointing, real colpali_train_set pages), the MaxSim op drops from 7.8 GiB to 61 MiB at batch 128. About 130×, same GPU. The fused kernel's 61 MiB fits in the memory scraps vanilla can't allocate. (6/N)

6h392

Tony Wu@tonywu_71

👀 Full walkthrough — the tiling, the online max, the backward pass, step-through animations and benchmark plots:

https://hcompai.github.io/late-interaction-kernels/how-it-works.html (11/N)

6h302

Erik Kaunismäki@ErikKaum

@tonywu_71 @Aurelien_L_ Awesome work guys 🙌 And those visuals are just so nice 😍

5h261

Lunari@0x_lun

@tonywu_71 @Aurelien_L_ fusing MaxSim was always the obvious bottleneck but nobody had clean triton kernels for it until now

curious how the memory savings hold up at longer sequence lengths with dense token sets

5h24

Antoine Chaffin@antoine_chaffin

@tonywu_71 @Aurelien_L_ late interaction is getting commoditized also, if I rebase my branch, it means you can train ColPali models with LIK... ;)

5h16

Antoine E.@antoine_edy

@tonywu_71 @Aurelien_L_ The design walkthrough webpage is sooo good 🫶

4h5