/Tech9h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

106212207K
Original postOmar Khattab#166
Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 5.1K Views
Sentiment

Users praised the fused Triton kernels release for MaxSim scoring in ColBERT and ColPali because of strong community collaboration, helpful PR reviews, and impressive visuals from the contributors.

Pos
100.0%
Neg
0.0%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2KBOOKMARKS7LIKES15RETWEETS1
Raphaël Sourty@raphaelsrty

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.

It's available in PyLate, it will accelerate both training and inference of multi-vector models

pip install "pylate[lik]"

so cool, from @tonywu_71 and @Aurelien_L_

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

8hViews 2KLikes 15Bookmarks 7
REPLIES1
Tony Wu@tonywu_71

@PonyRoi's flash-maxsim and @ErikKaum's MaxSim kernel tackled the same problem independently: fused Triton and hand-written CUDA + Metal respectively.

Really nice work, and very cool to see the community converging on the right ideas (13/N)

9hViews 74Likes 4
Tony Wu@tonywu_71

Full repo (Apache-2.0, pip install late-interaction-kernels):

https://github.com/hcompai/late-interaction-kernels (12/N)

9hViews 37Likes 2Bookmarks 1
Tony Wu@tonywu_71

⚡ So we don't build it. Stream doc tiles through on-chip SRAM (the GPU's fast scratchpad), keep a running max in registers, never write the grid to HBM. Same outer-product tiling as FlashAttention, softmax swapped for a plain max. (4/N)

9hViews 44Likes 4
Tony Wu@tonywu_71

Late-interaction models represent queries and docs as sets of token embeddings. Scoring = compare every query token to every doc token, keep the max per query token. That's MaxSim.

The naive path materializes the full [Nq·Nd·Lq·Ld] grid in GPU memory, then takes the max. (2/N)

9hViews 85Likes 3
Tony Wu@tonywu_71

LIK started as a side project with my friend and brilliant colleague @Aurelien_L_. It was so much to build!

Glad to be part of the community effort to push late-interaction costs toward zero 🙌🏼

(14/N)

9hViews 59Likes 3
Tony Wu@tonywu_71

Tsm @raphaelsrty & @antoine_chaffin (PyLate) and @MaceQuent1 & @ManuelFaysse (colpali-engine) for helping with the PR reviews for the integration! (10/N)

9hViews 40Likes 3
Tony Wu@tonywu_71

Both late-interaction training already come with day-0️⃣ LIK support in their latest releases:

pip install "pylate[lik]" pip install "colpali-engine[lik]"

No training-loop changes. Runs on CUDA sm_75+ (Turing/Ampere/Hopper) and Apple Silicon (MPS). (9/N)

9hViews 38Likes 3
Tony Wu@tonywu_71

📈 Full ColQwen2 (colqwen2-base) and PyLate (GTE-ModernColBERT-v1) fine-tunes land on vanilla loss curves step for step. Same accuracy. Freed memory → larger batch sizes on the same GPU.

Full benchmark results: https://github.com/hcompai/late-interaction-kernels/blob/main/docs/benchmarks.md (8/N)

9hViews 36Likes 3
Tony Wu@tonywu_71

The grid is ~0.5 GB per 1k docs at ColPali scale (128 query tokens × 1024 page patches). In contrastive training (Nq = Nd = B) it grows as B², preventing larger batch sizes. (3/N)

9hViews 72Likes 2
Tony Wu@tonywu_71

Mathematically identical to naive. No approximation and numerically equivalent to PyTorch (fp32 accumulators, parity tested). Backward is also fused. (5/N)

9hViews 42Likes 2
Tony Wu@tonywu_71

⏩ Not writing the grid is faster too, at matched numerics:

• 1.7–16× on reranking / inference (longer Ld → higher end) • 5.0–6.9× on MaxSim (fwd + bwd) in PyLate's cached-contrastive loss • long-context (Ld ≥ 8k): shapes the naive path can't run

(7/N)

9hViews 39Likes 2
Tony Wu@tonywu_71

💾 In real ColQwen2 training with colpali-engine (H100, LoRA + grad-checkpointing, real colpali_train_set pages), the MaxSim op drops from 7.8 GiB to 61 MiB at batch 128. About 130×, same GPU. The fused kernel's 61 MiB fits in the memory scraps vanilla can't allocate. (6/N)

9hViews 39Likes 2
Tony Wu@tonywu_71

👀 Full walkthrough — the tiling, the online max, the backward pass, step-through animations and benchmark plots:

https://hcompai.github.io/late-interaction-kernels/how-it-works.html (11/N)

9hViews 30Likes 2

@tonywu_71 @Aurelien_L_ Awesome work guys 🙌 And those visuals are just so nice 😍

9hViews 26Likes 1
Lunari@0x_lun

@tonywu_71 @Aurelien_L_ fusing MaxSim was always the obvious bottleneck but nobody had clean triton kernels for it until now

curious how the memory savings hold up at longer sequence lengths with dense token sets

9hViews 24
Antoine Chaffin@antoine_chaffin

@tonywu_71 @Aurelien_L_ late interaction is getting commoditized also, if I rebase my branch, it means you can train ColPali models with LIK... ;)

8hViews 16
Antoine E.@antoine_edy

@tonywu_71 @Aurelien_L_ The design walkthrough webpage is sooo good 🫶

8hViews 5