/Tech1h ago

New LCLMs Encode 16 Tokens as One Latent Token for Faster LLMs

74812212.6K

#404

Original post

Pavel Izmailov@Pavel_Izmailov#404inTech

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

10:13 AM · Jun 10, 2026 · 2.1K Views

/Tech1h ago

New LCLMs Encode 16 Tokens as One Latent Token for Faster LLMs

74812212.6K

#404

Original post

Pavel Izmailov@Pavel_Izmailov#404inTech

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

10:13 AM · Jun 10, 2026 · 2.1K Views

Sentiment

Users highlight the LCLM method's potential to enable agents with tools for selectively uncompressing key context parts as a cool capability for faster LLMs.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS507LIKES5RETWEETS3

Sean McLeish@SeanMcleish

Humans don’t maintain exact, line-by-line recall of huge contexts like full codebases or long legal documents. We keep a high-level mental model, then look things up when precision matters. We enable LLMs to do this, with high speed.

Pavel Izmailov@Pavel_Izmailov

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

59m50750

BOOKMARKS2REPLIES1

Pavel Izmailov@Pavel_Izmailov

Paper: https://arxiv.org/abs/2606.09659 Models: https://huggingface.co/latent-context Code: https://github.com/LeonLixyz/LCLM

Lead by @iamleonli with amazing collaborators: @SeanMcleish @tonychenxyz @qw3rtman @tingtang222 @artemg314 @tomgoldsteincs @LotfiSanae @micahgoldblum and more!

Pavel Izmailov@Pavel_Izmailov

One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!

1h17842

Pavel Izmailov@Pavel_Izmailov

We experiment with lots of architectures, and the final one looks like this. Encoder transformer encodes chunks of tokens, followed by pooling and an MLP adapter; the output goes into the standard LLM decoder.

Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.

Pavel Izmailov@Pavel_Izmailov

New paper: Latent Context Language Models (LCLMs)!

Idea: encode 16 tokens as 1 latent token, and have the LLM work on top of the latent tokens. Result: general-purpose model with much better performance / speed / memory usage frontier.

1h34132

Pavel Izmailov@Pavel_Izmailov

We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.

Pavel Izmailov@Pavel_Izmailov

Importantly, we can compress arbitrary pieces of the context and mix in normal uncompress tokens.

1h11420

Pavel Izmailov@Pavel_Izmailov

One other cool thing is that we can make an agent with a tool to uncompress important parts of the context, if it needs to look at it again in more detail. That gives even better performance!

Pavel Izmailov@Pavel_Izmailov

We outperform the baselines (KV cache compression) on both time-to-first token and peak GPU memory. Effectively our method replaces the original context with a much smaller context, with very little extra computation.

1h6710

Pavel Izmailov@Pavel_Izmailov

We train the whole model in a staged pipeline for next token prediction on a mix of context reconstruction and generic next-token prediction.

1h6510

Pavel Izmailov@Pavel_Izmailov

And see also threads by Micah and Leon!

Leon@iamleonli

How far can we compress the discrete tokens in an LLM's context into compact latent vectors?

With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16× and land on a new Pareto frontier for long-context inference. 🧵(1/n)

1h21710