/Tech9h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

106212207K

#166

Original post

Omar Khattab#166

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 5.1K Views

/Tech9h ago

Fused Triton Kernels Released For MaxSim Late Interaction Scoring

106212207K

#166

Original post

Omar Khattab#166

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

6:29 AM · Jun 10, 2026 · 5.1K Views

Sentiment

Users praised the fused Triton kernels release for MaxSim scoring in ColBERT and ColPali because of strong community collaboration, helpful PR reviews, and impressive visuals from the contributors.

Pos

100.0%

Neg

0.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2KBOOKMARKS7LIKES15RETWEETS1

Raphaël Sourty@raphaelsrty

Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did.

It's available in PyLate, it will accelerate both training and inference of multi-vector models

pip install "pylate[lik]"

so cool, from @tonywu_71 and @Aurelien_L_

Tony Wu@tonywu_71

Very excited to release late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT, ColPali & LateOn. 🚀

Numerically equivalent to PyTorch at a fraction of the memory, with day-0 support in PyLate & colpali-engine. (1/N 🧵)

8h2K157

REPLIES1

Tony Wu@tonywu_71

@PonyRoi's flash-maxsim and @ErikKaum's MaxSim kernel tackled the same problem independently: fused Triton and hand-written CUDA + Metal respectively.

Really nice work, and very cool to see the community converging on the right ideas (13/N)

9h744

Tony Wu@tonywu_71

Full repo (Apache-2.0, pip install late-interaction-kernels):

https://github.com/hcompai/late-interaction-kernels (12/N)

9h3721

Tony Wu@tonywu_71

⚡ So we don't build it. Stream doc tiles through on-chip SRAM (the GPU's fast scratchpad), keep a running max in registers, never write the grid to HBM. Same outer-product tiling as FlashAttention, softmax swapped for a plain max. (4/N)

9h444

Tony Wu@tonywu_71

Late-interaction models represent queries and docs as sets of token embeddings. Scoring = compare every query token to every doc token, keep the max per query token. That's MaxSim.

The naive path materializes the full [Nq·Nd·Lq·Ld] grid in GPU memory, then takes the max. (2/N)

9h853

Tony Wu@tonywu_71

LIK started as a side project with my friend and brilliant colleague @Aurelien_L_. It was so much to build!

Glad to be part of the community effort to push late-interaction costs toward zero 🙌🏼

(14/N)

9h593

Tony Wu@tonywu_71

Tsm @raphaelsrty & @antoine_chaffin (PyLate) and @MaceQuent1 & @ManuelFaysse (colpali-engine) for helping with the PR reviews for the integration! (10/N)

9h403

Tony Wu@tonywu_71

Both late-interaction training already come with day-0️⃣ LIK support in their latest releases:

pip install "pylate[lik]" pip install "colpali-engine[lik]"

No training-loop changes. Runs on CUDA sm_75+ (Turing/Ampere/Hopper) and Apple Silicon (MPS). (9/N)

9h383

Tony Wu@tonywu_71

📈 Full ColQwen2 (colqwen2-base) and PyLate (GTE-ModernColBERT-v1) fine-tunes land on vanilla loss curves step for step. Same accuracy. Freed memory → larger batch sizes on the same GPU.

Full benchmark results: https://github.com/hcompai/late-interaction-kernels/blob/main/docs/benchmarks.md (8/N)

9h363

Tony Wu@tonywu_71

The grid is ~0.5 GB per 1k docs at ColPali scale (128 query tokens × 1024 page patches). In contrastive training (Nq = Nd = B) it grows as B², preventing larger batch sizes. (3/N)

9h722

Tony Wu@tonywu_71

Mathematically identical to naive. No approximation and numerically equivalent to PyTorch (fp32 accumulators, parity tested). Backward is also fused. (5/N)

9h422

Tony Wu@tonywu_71

⏩ Not writing the grid is faster too, at matched numerics:

• 1.7–16× on reranking / inference (longer Ld → higher end) • 5.0–6.9× on MaxSim (fwd + bwd) in PyLate's cached-contrastive loss • long-context (Ld ≥ 8k): shapes the naive path can't run

(7/N)

9h392

Tony Wu@tonywu_71

💾 In real ColQwen2 training with colpali-engine (H100, LoRA + grad-checkpointing, real colpali_train_set pages), the MaxSim op drops from 7.8 GiB to 61 MiB at batch 128. About 130×, same GPU. The fused kernel's 61 MiB fits in the memory scraps vanilla can't allocate. (6/N)

9h392

Tony Wu@tonywu_71

👀 Full walkthrough — the tiling, the online max, the backward pass, step-through animations and benchmark plots:

https://hcompai.github.io/late-interaction-kernels/how-it-works.html (11/N)

9h302

Erik Kaunismäki@ErikKaum

@tonywu_71 @Aurelien_L_ Awesome work guys 🙌 And those visuals are just so nice 😍

9h261

Lunari@0x_lun

@tonywu_71 @Aurelien_L_ fusing MaxSim was always the obvious bottleneck but nobody had clean triton kernels for it until now

curious how the memory savings hold up at longer sequence lengths with dense token sets

9h24

Antoine Chaffin@antoine_chaffin

@tonywu_71 @Aurelien_L_ late interaction is getting commoditized also, if I rebase my branch, it means you can train ColPali models with LIK... ;)

8h16

Antoine E.@antoine_edy

@tonywu_71 @Aurelien_L_ The design walkthrough webpage is sooo good 🫶

8h5