Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels greatly speed up training while lowering the memory requirements CPU enjoyers: TACHIOM enables lightning fast multi-vector indexing and search directly on CPU
Users are excited about PyLate's MaxSim Kernels and TACHIOM releases because they deliver strong ecosystem benefits from research and impressive GPU/CPU search performance gains.
Most Activity

MaxSim fused kernels: At the core of MaxSim is a very large |Q| * |D| matrix multiplication This is very similar to the original attention quadratic cost, that got lifted by fused kernels in FlashAttention So... let's do the same, but for MaxSim?

@PonyRoi @tonywu_71 @Aurelien_L_ Usual cc to my fellow late interaction enjoyer gang @helloiamleonie @17Ahmetyucel @doesdatmaksense @MehdiAllahyari @jobergum @vishal_learner @trillarnie @CShorten30 @tonywu_71 (cheater) @ManuelFaysse @din0s_ @Robro612

I am very happy about this release, because it shows how a good ecosystem directly benefit from ongoing research and ultimately the users A big thanks to @PonyRoi, @tonywu_71 and @Aurelien_L_ for creating such cool kernels and engaging in discussions around design and performance, it was really cool to see! I can't wait to launch big trainings with those beauties
Also big thanks to @SilvioMartinico for bearing with me with all my refactors and nits and also making changes to TACHIOM's internal to ease the merge, really appreciate it and I believe TACHIOM will be very useful for a lot of ressource-limited users (and also open new avenue of research)
Finally, as per usual, big thanks to my co-maintainer @raphaelsrty for helping through the whole process (and essentially handling the kernels discussions after I just said "yeah, we should find a nice way to merge" 😇)
Blazing fast training kernels and blazing fast (sometimes beating GPU in my exp.) CPU indices?
Never been a better time to be a late interactor.
Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels greatly speed up training while lowering the memory requirements CPU enjoyers: TACHIOM enables lightning fast multi-vector indexing and search directly on CPU

As explained in @SilvioMartinico's thread, they "explicitly allocates centroids based on token frequency and semantic variance, partitioning the workload"
This allows to cluster 600M vectors into 262K centroids in just 8 minutes on a CPU and 10 ms single-CPU search on MS MARCO (8M documents)

We benched both implementations (as well as @ErikKaum's HF kernel, PR soon? 😇) and you can find a lot of discussions from here: https://github.com/lightonai/pylate/issues/224
Ultimately, we decided to merge the different kernels and make them interchangeable You can leverage them by simply installing the corresponding package https://lightonai.github.io/pylate/documentation/backends/#scoring-backends

@PonyRoi was the first one to open a pull request to PyLate to merge FlashMaxSim It was followed later by @tonywu_71's and @Aurelien_L_'s PR to merge LIK https://x.com/tonywu_71/status/2064701365318767100?s=20 (very nice explanations/visualizations!)

They are all very strong solutions that speed-up the training workloads while reducing memory pressure Give them a shot and do not hesitate to give feedback, I am sure these kernels will grow and become even better in the future!

TACHIOM index is very easy to use in PyLate, as it shares the exact same API as all of the other indexes They're just a twist, because you have to also get (and send) the token ids corresponding to embeddings
Else it's as simple as usual: insert and search! https://lightonai.github.io/pylate/documentation/retrieval/#tachiom-retrieval

TACHIOM: The most used late interaction indexes relies on k-means to compute centroids used for ANN The issue is that, at scale, it becomes very costly to run on CPU... Enters TACHIOM, that cleverly exploits the **tokens ids** to speed up the process!

@Robro612 @antoine_chaffin That is amazing to hear! I actually haven't even tested it against GPU indexes myself yet, so that is a huge surprise win 😅 Excited for how late-interaction is evolving!

@antoine_chaffin back to back banger releases (yesterday Tony, today you)!

@SilvioMartinico @Robro612 Now tell the world we need a version of LateOn that works well with TACHIOM And tell them it's coming on Tuesday in the mean time 😇

@doesdatmaksense Well Tony's is essentially part of this one, he just could not wait to brag, couldn't he? @tonywu_71 But I'll double down soon 😇

@antoine_chaffin @PonyRoi @tonywu_71 @Aurelien_L_ Thanks to you and the whole team! Loved collaborating on this, and I'm incredibly excited for the future of late-interaction! 🚀

@antoine_chaffin Buen punto. Si trabajas con agentes de IA, la memoria persistente local es clave. Mnemosyne implementa retrieval hibrido (semantico + temporal + entidades) sobre sqlite-vec. Sin dependencias en la nube. Todo open-source en GitHub. @mnemosyne_oss