/Tech6h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

105911196.1K
Original postOmar Khattab#166
Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 4.4K Views
Sentiment

Users praised the fused Triton kernels release for MaxSim scoring in ColBERT and ColPali because of strong community collaboration, helpful PR reviews, and impressive visuals from the contributors.

Pos
100.0%
Neg
0.0%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.7KBOOKMARKS7LIKES15RETWEETS1
Raphaël Sourty@raphaelsrty

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.

It's available in PyLate, it will accelerate both training and inference of multi-vector models

pip install "pylate[lik]"

so cool, from @tonywu_71 and @Aurelien_L_

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

4hViews 1.7KLikes 15Bookmarks 7
REPLIES1
Tony Wu@tonywu_71

@PonyRoi's flash-maxsim and @ErikKaum's MaxSim kernel tackled the same problem independently: fused Triton and hand-written CUDA + Metal respectively.

Really nice work, and very cool to see the community converging on the right ideas (13/N)

6hViews 74Likes 4
Tony Wu@tonywu_71

Full repo (Apache-2.0, pip install late-interaction-kernels):

https://github.com/hcompai/late-interaction-kernels (12/N)

6hViews 37Likes 2Bookmarks 1
Tony Wu@tonywu_71

⚡ So we don't build it. Stream doc tiles through on-chip SRAM (the GPU's fast scratchpad), keep a running max in registers, never write the grid to HBM. Same outer-product tiling as FlashAttention, softmax swapped for a plain max. (4/N)

6hViews 44Likes 4
Tony Wu@tonywu_71

Late-interaction models represent queries and docs as sets of token embeddings. Scoring = compare every query token to every doc token, keep the max per query token. That's MaxSim.

The naive path materializes the full [Nq·Nd·Lq·Ld] grid in GPU memory, then takes the max. (2/N)

6hViews 85Likes 3
Tony Wu@tonywu_71

LIK started as a side project with my friend and brilliant colleague @Aurelien_L_. It was so much to build!

Glad to be part of the community effort to push late-interaction costs toward zero 🙌🏼

(14/N)

6hViews 59Likes 3
Tony Wu@tonywu_71

Tsm @raphaelsrty & @antoine_chaffin (PyLate) and @MaceQuent1 & @ManuelFaysse (colpali-engine) for helping with the PR reviews for the integration! (10/N)

6hViews 40Likes 3
Tony Wu@tonywu_71

Both late-interaction training already come with day-0️⃣ LIK support in their latest releases:

pip install "pylate[lik]" pip install "colpali-engine[lik]"

No training-loop changes. Runs on CUDA sm_75+ (Turing/Ampere/Hopper) and Apple Silicon (MPS). (9/N)

6hViews 38Likes 3
Tony Wu@tonywu_71

📈 Full ColQwen2 (colqwen2-base) and PyLate (GTE-ModernColBERT-v1) fine-tunes land on vanilla loss curves step for step. Same accuracy. Freed memory → larger batch sizes on the same GPU.

Full benchmark results: https://github.com/hcompai/late-interaction-kernels/blob/main/docs/benchmarks.md (8/N)

6hViews 36Likes 3
Tony Wu@tonywu_71

The grid is ~0.5 GB per 1k docs at ColPali scale (128 query tokens × 1024 page patches). In contrastive training (Nq = Nd = B) it grows as B², preventing larger batch sizes. (3/N)

6hViews 72Likes 2
Tony Wu@tonywu_71

Mathematically identical to naive. No approximation and numerically equivalent to PyTorch (fp32 accumulators, parity tested). Backward is also fused. (5/N)

6hViews 42Likes 2
Tony Wu@tonywu_71

⏩ Not writing the grid is faster too, at matched numerics:

• 1.7–16× on reranking / inference (longer Ld → higher end) • 5.0–6.9× on MaxSim (fwd + bwd) in PyLate's cached-contrastive loss • long-context (Ld ≥ 8k): shapes the naive path can't run

(7/N)

6hViews 39Likes 2
Tony Wu@tonywu_71

💾 In real ColQwen2 training with colpali-engine (H100, LoRA + grad-checkpointing, real colpali_train_set pages), the MaxSim op drops from 7.8 GiB to 61 MiB at batch 128. About 130×, same GPU. The fused kernel's 61 MiB fits in the memory scraps vanilla can't allocate. (6/N)

6hViews 39Likes 2
Tony Wu@tonywu_71

👀 Full walkthrough — the tiling, the online max, the backward pass, step-through animations and benchmark plots:

https://hcompai.github.io/late-interaction-kernels/how-it-works.html (11/N)

6hViews 30Likes 2

@tonywu_71 @Aurelien_L_ Awesome work guys 🙌 And those visuals are just so nice 😍

5hViews 26Likes 1
Lunari@0x_lun

@tonywu_71 @Aurelien_L_ fusing MaxSim was always the obvious bottleneck but nobody had clean triton kernels for it until now

curious how the memory savings hold up at longer sequence lengths with dense token sets

5hViews 24
Antoine Chaffin@antoine_chaffin

@tonywu_71 @Aurelien_L_ late interaction is getting commoditized also, if I rebase my branch, it means you can train ColPali models with LIK... ;)

5hViews 16
Antoine E.@antoine_edy

@tonywu_71 @Aurelien_L_ The design walkthrough webpage is sooo good 🫶

4hViews 5