We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x.
Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency.
http://github.com/perplexityai/pplx-garden














