End-to-End Context Compression at Scale
Encoder-decoder compressors - map a long token sequence to a shorter sequence of latent embeddings, not competitive with KV cache compression.
This work revisits encoder-decoder compression.
Perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors.
Continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16.
"We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage."