New paper: Latent Context Language Models (LCLMs)!
Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.
New paper: Latent Context Language Models (LCLMs)!
Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.
Users highlight the LCLM method's potential to enable agents with tools for selectively uncompressing key context parts as a cool capability for faster LLMs.
Humans don鈥檛 maintain exact, line-by-line recall of huge contexts like full codebases or long legal documents. We keep a high-level mental model, then look things up when precision matters. We enable LLMs to do this, with high speed.
New paper: Latent Context Language Models (LCLMs)!
Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.
Paper: https://arxiv.org/abs/2606.09659 Models: https://huggingface.co/latent-context Code: https://github.com/LeonLixyz/LCLM
Lead by @iamleonli with amazing collaborators: @SeanMcleish @tonychenxyz @qw3rtman @tingtang222 @artemg314 @tomgoldsteincs @LotfiSanae @micahgoldblum and more!
One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!
How far can we compress the discrete tokens in an LLM's context into compact latent vectors?
With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16脳 and land on a new Pareto frontier for long-context inference. 馃У(1/n)
We experiment with lots of architectures, and the final one looks like this. Encoder transformer encodes chunks of tokens, followed by pooling and an MLP adapter; the output goes into the standard LLM decoder.
Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.
New paper: Latent Context Language Models (LCLMs)!
Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.
One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!
We outperform the baselines (KV cache compression) on both time-to-first token and peak GPU memory. Effectively our method replaces the original context with a much smaller context, with very little extra computation.
And see also threads by Micah and Leon!
How far can we compress the discrete tokens in an LLM's context into compact latent vectors?
With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16脳 and land on a new Pareto frontier for long-context inference. 馃У(1/n)
We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.
We experiment with lots of architectures, and the final one looks like this. Encoder transformer encodes chunks of tokens, followed by pooling and an MLP adapter; the output goes into the standard LLM decoder.
Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.
We outperform the baselines (KV cache compression) on both time-to-first token and peak GPU memory. Effectively our method replaces the original context with a much smaller context, with very little extra computation.
We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.
New paper: Latent Context Language Models (LCLMs)!
Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.