/Tech1d ago

Latent Context Language Models compress context tokens up to 16x, cutting time-to-first-token by 8.8x on the RULER benchmark

AI Judge changed title after evaluation, original title: "Columbia's Micah Goldblum and collaborators release Latent Context Language Models, achieving an 8.8x time-to-first-token speedup via context compression"

Story Overview

A new encoder-decoder setup called Latent Context Language Models turns lengthy token sequences into compact latent embeddings that a decoder LLM can consume directly. This approach sidesteps the memory explosion of growing KV caches during long-context inference, with 0.6B-encoder and 4B-decoder variants pre-trained on hundreds of billions of tokens and tested at compression ratios up to 1:16.

3572910540470.3K
Original post

End-to-End Context Compression at Scale

Encoder-decoder compressors - map a long token sequence to a shorter sequence of latent embeddings, not competitive with KV cache compression.

This work revisits encoder-decoder compression.

Perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors.

Continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16.

"We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage."

3:32 AM · Jun 9, 2026 · 8.9K Views
Benchmark Wins

Inference gets faster without accuracy trade-offs

On RULER at 4k context and LongBench at 64k the models deliver better accuracy-latency and accuracy-memory curves than KV-cache compression baselines, including roughly 8.8× faster time-to-first-token on RULER at higher ratios while keeping or improving quality.

Open Release

Everything needed to experiment is already public

The team released the models on Hugging Face under the latent-context org and the code on GitHub, so researchers can reproduce the architecture search, continual pre-training, and long-context evaluations without waiting for further announcements.

Sentiment

Users praise Latent Context Language Models as cool and impressive work because of their effective compression of massive contexts along with faster open-source progress and AI democratization.

Pos
100.0%
Neg
0.0%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS45KBOOKMARKS279LIKES392RETWEETS59REPLIES13
Micah Goldblum@micahgoldblum

We trained language models that compress massive contexts into tiny latent representations. Latent Context Language Models (LCLMs) outperform existing KV cache compression methods on the latency/accuracy frontier. 🧵1/10

1dViews 45KLikes 392Bookmarks 279
Yann LeCun@ylecun

@micahgoldblum Compression architecture == ConvNet-style hierarchy through pooling/stride.

Micah Goldblum@micahgoldblum

We trained language models that compress massive contexts into tiny latent representations. Latent Context Language Models (LCLMs) outperform existing KV cache compression methods on the latency/accuracy frontier. 🧵1/10

1dViews 11.3KLikes 78Bookmarks 23
Micah Goldblum@micahgoldblum

Paper📝: https://arxiv.org/abs/2606.09659 Models🤖: https://huggingface.co/latent-context Code💻: https://github.com/LeonLixyz/LCLM 10/10

Micah Goldblum@micahgoldblum

Huge thanks to all my amazing collaborators who made this possible: @iamleonli, @SeanMcleish, @tonychenxyz, @qw3rtman, @tingtang222, @artemg314, Suhas, and more. Ang (Leon) led the charge here and did an enormous amount of work on this project! 9/10

1dViews 1.5KLikes 40Bookmarks 21
Leon@iamleonli

How far can we compress the discrete tokens in an LLM's context into compact latent vectors?

With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16× and land on a new Pareto frontier for long-context inference. 🧵(1/n)

1dViews 4.5KLikes 47Bookmarks 8
Micah Goldblum@micahgoldblum

With the right architecture + training recipe, learned compression works far better than prior work suggested. We trained 4×, 8×, and 16× compressors jointly with pretrained decoders on 350B+ tokens, and we tested out tons of architectures and staged training pipelines. 5/10

Micah Goldblum@micahgoldblum

Instead, we use a small encoder to convert raw text into compact latent representations in small chunks. The larger decoder then only takes in those compressed latent representations. This technique does not require the full prefill, and it is hardware and software friendly. 4/10

1dViews 1KLikes 18Bookmarks 5
Micah Goldblum@micahgoldblum

Instead, we use a small encoder to convert raw text into compact latent representations in small chunks. The larger decoder then only takes in those compressed latent representations. This technique does not require the full prefill, and it is hardware and software friendly. 4/10

Micah Goldblum@micahgoldblum

Most existing approaches compress the KV cache after the model processes the entire context. That means you still pay the full prefill cost, latency remains high, and many of those methods are hard to deploy in real systems. 3/10

1dViews 1.1KLikes 17Bookmarks 3
Micah Goldblum@micahgoldblum

Our models establish a new Pareto frontier on long-context benchmarks like RULER, LongBench, and LongHealth. At high compression ratios, we see much lower memory consumption, substantially faster TTFT, and strong long-context accuracy. 6/10

Micah Goldblum@micahgoldblum

With the right architecture + training recipe, learned compression works far better than prior work suggested. We trained 4×, 8×, and 16× compressors jointly with pretrained decoders on 350B+ tokens, and we tested out tons of architectures and staged training pipelines. 5/10

1dViews 908Likes 19Bookmarks 2

models: https://huggingface.co/latent-context code: https://github.com/LeonLixyz/LCLM arxiv: https://arxiv.org/abs/2606.09659

End-to-End Context Compression at Scale

Encoder-decoder compressors - map a long token sequence to a shorter sequence of latent embeddings, not competitive with KV cache compression.

This work revisits encoder-decoder compression.

Perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors.

Continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16.

"We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage."

1dViews 1.7KLikes 3Bookmarks 4
Micah Goldblum@micahgoldblum

We then built an agentic system where the model globally reasons over compressed context and selectively expands regions on demand. This lets an agent “skim” enormous corpora before zooming in on precise details. 7/10

Micah Goldblum@micahgoldblum

Our models establish a new Pareto frontier on long-context benchmarks like RULER, LongBench, and LongHealth. At high compression ratios, we see much lower memory consumption, substantially faster TTFT, and strong long-context accuracy. 6/10

1dViews 818Likes 16Bookmarks 1
Micah Goldblum@micahgoldblum

Long-context inference is becoming a massive bottleneck for AI systems. Agents increasingly need to reason over huge codebases, long chats, reasoning traces, etc. Naively scaling context is prohibitively expensive. 2/10

Micah Goldblum@micahgoldblum

We trained language models that compress massive contexts into tiny latent representations. Latent Context Language Models (LCLMs) outperform existing KV cache compression methods on the latency/accuracy frontier. 🧵1/10

1dViews 1.7KLikes 17Bookmarks 1
Micah Goldblum@micahgoldblum

Most existing approaches compress the KV cache after the model processes the entire context. That means you still pay the full prefill cost, latency remains high, and many of those methods are hard to deploy in real systems. 3/10

Micah Goldblum@micahgoldblum

Long-context inference is becoming a massive bottleneck for AI systems. Agents increasingly need to reason over huge codebases, long chats, reasoning traces, etc. Naively scaling context is prohibitively expensive. 2/10

1dViews 1.4KLikes 17Bookmarks 1
Micah Goldblum@micahgoldblum

We believe that compression architectures can meet the growing context demands of agents and reasoning models. How to plug these models into complex agentic systems and how to compress text online while generating are open problems. 8/10

Micah Goldblum@micahgoldblum
1dViews 868Likes 15Bookmarks 1
Micah Goldblum@micahgoldblum
Micah Goldblum@micahgoldblum

We then built an agentic system where the model globally reasons over compressed context and selectively expands regions on demand. This lets an agent “skim” enormous corpora before zooming in on precise details. 7/10

1dViews 813Likes 11Bookmarks 1
Micah Goldblum@micahgoldblum

Huge thanks to all my amazing collaborators who made this possible: @iamleonli, @SeanMcleish, @tonychenxyz, @qw3rtman, @tingtang222, @artemg314, Suhas, and more. Ang (Leon) led the charge here and did an enormous amount of work on this project! 9/10

Micah Goldblum@micahgoldblum

We believe that compression architectures can meet the growing context demands of agents and reasoning models. How to plug these models into complex agentic systems and how to compress text online while generating are open problems. 8/10

1dViews 1.3KLikes 12Bookmarks 0
Jatin Prakash@bicycleman15

@micahgoldblum Hey @micahgoldblum @iamleonli really cool work! and great execution!

We did explore a similar idea a few months ago and,

took it a step further to yield test-time control 🕹️ of inference costs in a single architecture :)))

19hViews 7Likes 1Bookmarks 1
Leon@iamleonli

Grateful to @SeanMcleish @tonychenxyz @qw3rtman @tomgoldsteincs @LotfiSanae @micahgoldblum @Pavel_Izmailov and the whole team 🙏 Thanks to @TongPetersb @pfactorialz @EBorgnia for helpful discussions. Thanks @tonychenxyz for drafting the tweet! (11/n)

1dViews 39
Leon@iamleonli

KV cache memory grows linearly and attention compute quadratically with context length, so longer context means explosive memory and slower decoding. We train LCLMs to fold context into a handful of latent vectors while keeping the underlying model's abilities intact. (2/n)

1dViews 29
Leon@iamleonli

The setup: A 0.6B encoder compresses raw tokens into latents, an adapter maps them into the decoder's space, and a 4B decoder reads them as latent context in place of the original tokens. We train the pair end-to-end on 350B+ tokens, at 4×, 8×, and 16× compression ratios. (3/n)

1dViews 22
Arcadiy Ivanov@IAmArcIvanov

@micahgoldblum @threadreaderapp unroll

21hViews 19
Leon@iamleonli

We are not the first to compress context. Prior work mostly evicts KV cache entries using hand-designed heuristics. Against SnapKV, KVzip, Expected Attention, and Attention Matching, LCLMs match or beat accuracy with up to 8.8× faster TTFT and far less memory. (4/n)

1dViews 18
Load more posts