/Tech1h ago

New LCLMs Encode 16 Tokens as One Latent Token for Faster LLMs

74812212.6K
Original post
Pavel Izmailov@Pavel_Izmailov#404inTech

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

10:13 AM · Jun 10, 2026 · 2.1K Views
Sentiment

Users highlight the LCLM method's potential to enable agents with tools for selectively uncompressing key context parts as a cool capability for faster LLMs.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS507LIKES5RETWEETS3
Sean McLeish@SeanMcleish

Humans don’t maintain exact, line-by-line recall of huge contexts like full codebases or long legal documents. We keep a high-level mental model, then look things up when precision matters. We enable LLMs to do this, with high speed.

Pavel Izmailov@Pavel_Izmailov

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

59mViews 507Likes 5Bookmarks 0
BOOKMARKS2REPLIES1
Pavel Izmailov@Pavel_Izmailov

Paper: https://arxiv.org/abs/2606.09659 Models: https://huggingface.co/latent-context Code: https://github.com/LeonLixyz/LCLM

Lead by @iamleonli with amazing collaborators: @SeanMcleish @tonychenxyz @qw3rtman @tingtang222 @artemg314 @tomgoldsteincs @LotfiSanae @micahgoldblum and more!

Pavel Izmailov@Pavel_Izmailov

One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!

1hViews 178Likes 4Bookmarks 2
Pavel Izmailov@Pavel_Izmailov

We experiment with lots of architectures, and the final one looks like this. Encoder transformer encodes chunks of tokens, followed by pooling and an MLP adapter; the output goes into the standard LLM decoder.

Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.

Pavel Izmailov@Pavel_Izmailov

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

1hViews 341Likes 3Bookmarks 2
Pavel Izmailov@Pavel_Izmailov

We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.

Pavel Izmailov@Pavel_Izmailov

We experiment with lots of architectures, and the final one looks like this. Encoder transformer encodes chunks of tokens, followed by pooling and an MLP adapter; the output goes into the standard LLM decoder.

Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.

1hViews 114Likes 2Bookmarks 0
Pavel Izmailov@Pavel_Izmailov

One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!

Pavel Izmailov@Pavel_Izmailov

We outperform the baselines (KV cache compression) on both time-to-first token and peak GPU memory. Effectively our method replaces the original context with a much smaller context, with very little extra computation.

1hViews 67Likes 1Bookmarks 0
Pavel Izmailov@Pavel_Izmailov

We outperform the baselines (KV cache compression) on both time-to-first token and peak GPU memory. Effectively our method replaces the original context with a much smaller context, with very little extra computation.

Pavel Izmailov@Pavel_Izmailov

We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.

1hViews 65Likes 1Bookmarks 0
Pavel Izmailov@Pavel_Izmailov

And see also threads by Micah and Leon!

Leon@iamleonli

How far can we compress the discrete tokens in an LLM's context into compact latent vectors?

With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16× and land on a new Pareto frontier for long-context inference. 🧵(1/n)

1hViews 217Likes 1Bookmarks 0