Releasing my first kernel on @huggingface: MaxSim
Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.
Result is 3–5× speedup compared to naive PyTorch.
Try it out 👇










