Micah Goldblum and collaborators open-source LCLMs, using latent context compression to deliver 8.8x faster long-context inference · Digg

/Tech7h ago

Micah Goldblum and collaborators open-source LCLMs, using latent context compression to deliver 8.8x faster long-context inference

The models were trained on 350 billion tokens.

152303710313.6K

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#385inTech

End-to-End Context Compression at Scale

Encoder-decoder compressors - map a long token sequence to a shorter sequence of latent embeddings, not competitive with KV cache compression.

This work revisits encoder-decoder compression.

Perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors.

Continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16.

"We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage."

3:32 AM · Jun 9, 2026 · 5.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.5KBOOKMARKS29LIKES65RETWEETS13

Micah Goldblum@micahgoldblum

We trained language models that compress massive contexts into tiny latent representations. Latent Context Language Models (LCLMs) outperform existing KV cache compression methods on the latency/accuracy frontier. 🧵1/10

3h2.5K6529

REPLIES1

Micah Goldblum@micahgoldblum

Long-context inference is becoming a massive bottleneck for AI systems. Agents increasingly need to reason over huge codebases, long chats, reasoning traces, etc. Naively scaling context is prohibitively expensive. 2/10

Micah Goldblum@micahgoldblum

We trained language models that compress massive contexts into tiny latent representations. Latent Context Language Models (LCLMs) outperform existing KV cache compression methods on the latency/accuracy frontier. 🧵1/10

3h37550

Leon@iamleonli

How far can we compress the discrete tokens in an LLM's context into compact latent vectors?

With the right training recipe at large scale, our Latent Context Language Models (LCLMs) compress context up to 16× and land on a new Pareto frontier for long-context inference. 🧵(1/n)

2h2.2K296

Micah Goldblum@micahgoldblum

Paper📝: https://arxiv.org/abs/2606.09659 Models🤖: https://huggingface.co/latent-context Code💻: https://github.com/LeonLixyz/LCLM 10/10

Micah Goldblum@micahgoldblum

Huge thanks to all my amazing collaborators who made this possible: @iamleonli, @SeanMcleish, @tonychenxyz, @qw3rtman, @tingtang222, @artemg314, Suhas, and more. Ang (Leon) led the charge here and did an enormous amount of work on this project! 9/10

3h356122

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

models: https://huggingface.co/latent-context code: https://github.com/LeonLixyz/LCLM arxiv: https://arxiv.org/abs/2606.09659

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

End-to-End Context Compression at Scale

Encoder-decoder compressors - map a long token sequence to a shorter sequence of latent embeddings, not competitive with KV cache compression.

This work revisits encoder-decoder compression.

Perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors.

Continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16.

"We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage."

7h1.6K23

Micah Goldblum@micahgoldblum

With the right architecture + training recipe, learned compression works far better than prior work suggested. We trained 4×, 8×, and 16× compressors jointly with pretrained decoders on 350B+ tokens, and we tested out tons of architectures and staged training pipelines. 5/10

Micah Goldblum@micahgoldblum

Instead, we use a small encoder to convert raw text into compact latent representations in small chunks. The larger decoder then only takes in those compressed latent representations. This technique does not require the full prefill, and it is hardware and software friendly. 4/10

3h14791

Micah Goldblum@micahgoldblum

Our models establish a new Pareto frontier on long-context benchmarks like RULER, LongBench, and LongHealth. At high compression ratios, we see much lower memory consumption, substantially faster TTFT, and strong long-context accuracy. 6/10

Micah Goldblum@micahgoldblum

With the right architecture + training recipe, learned compression works far better than prior work suggested. We trained 4×, 8×, and 16× compressors jointly with pretrained decoders on 350B+ tokens, and we tested out tons of architectures and staged training pipelines. 5/10

3h13870

Micah Goldblum@micahgoldblum

Most existing approaches compress the KV cache after the model processes the entire context. That means you still pay the full prefill cost, latency remains high, and many of those methods are hard to deploy in real systems. 3/10

Micah Goldblum@micahgoldblum

Long-context inference is becoming a massive bottleneck for AI systems. Agents increasingly need to reason over huge codebases, long chats, reasoning traces, etc. Naively scaling context is prohibitively expensive. 2/10

3h29760

Micah Goldblum@micahgoldblum

Instead, we use a small encoder to convert raw text into compact latent representations in small chunks. The larger decoder then only takes in those compressed latent representations. This technique does not require the full prefill, and it is hardware and software friendly. 4/10

Micah Goldblum@micahgoldblum

Most existing approaches compress the KV cache after the model processes the entire context. That means you still pay the full prefill cost, latency remains high, and many of those methods are hard to deploy in real systems. 3/10

3h15960

Micah Goldblum@micahgoldblum

We believe that compression architectures can meet the growing context demands of agents and reasoning models. How to plug these models into complex agentic systems and how to compress text online while generating are open problems. 8/10

Micah Goldblum@micahgoldblum

3h14360

Micah Goldblum@micahgoldblum

We then built an agentic system where the model globally reasons over compressed context and selectively expands regions on demand. This lets an agent “skim” enormous corpora before zooming in on precise details. 7/10

Micah Goldblum@micahgoldblum

Our models establish a new Pareto frontier on long-context benchmarks like RULER, LongBench, and LongHealth. At high compression ratios, we see much lower memory consumption, substantially faster TTFT, and strong long-context accuracy. 6/10

3h13150

Micah Goldblum@micahgoldblum

Huge thanks to all my amazing collaborators who made this possible: @iamleonli, @SeanMcleish, @tonychenxyz, @qw3rtman, @tingtang222, @artemg314, Suhas, and more. Ang (Leon) led the charge here and did an enormous amount of work on this project! 9/10

Micah Goldblum@micahgoldblum

We believe that compression architectures can meet the growing context demands of agents and reasoning models. How to plug these models into complex agentic systems and how to compress text online while generating are open problems. 8/10

3h32540

Micah Goldblum@micahgoldblum

Micah Goldblum@micahgoldblum

We then built an agentic system where the model globally reasons over compressed context and selectively expands regions on demand. This lets an agent “skim” enormous corpora before zooming in on precise details. 7/10

3h12640