/Tech5h ago

Microsoft Research introduces Next-Latent Prediction, training transformers on internal latent states to achieve up to 3.3x faster inference

The method helps models construct compact world models.

366218350730.8K

#373

Original post

Jayden Teoh@jayden_teoh_

Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning and planning. It also unlocks up to 3.3x faster inference via self-speculative decoding! 🚀

8:26 AM · Jun 16, 2026 · 30.8K Views

Sentiment

Users praised NextLat for teaching transformers to predict latent states, calling the work awesome, cool, and promising with potential step-function impact on world models.

Pos

100.0%

Neg

0.0%

9 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS809LIKES12

John Langford@JohnCLangford

Nextlat is lots of fun.

This new version shows more scale, bootstrapping state tracking, and strong speculative execution.

Jayden Teoh@jayden_teoh_

Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

3h809121

BOOKMARKS4

Jayden Teoh@jayden_teoh_

8/ This is a joint work with my amazing collaborators at Microsoft Research @MSFTResearch: @manan_tomar, @KwangjunA, @edward_s_hu, @Tea_Pearce, @pratyusha_PS, Akshay Krishnamurthy, @riashatislam, @lamblabtsinghua, and @JohnCLangford

Please check out our blog, paper and code!

✍️ Blog: http://jaydenteoh.github.io/blog/2026/nextlat 💻 Code: http://github.com/JaydenTeoh/NextLat 📜 Paper: http://arxiv.org/abs/2511.05963

5h541124

RETWEETS81

Jayden Teoh@jayden_teoh_

Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

5h30.8K620516

REPLIES3

Vincent Herrmann@idivinci

@jayden_teoh_ Really nice work on hidden state prediction! But I have to point out that we introduced this methodology in our ICML 2025 and ICLR 2026 papers w\ @eric_alcaide, @robert_csordas and @SchmidhuberAI : https://icml.cc/virtual/2025/poster/44985 https://iclr.cc/virtual/2026/poster/10010832 please compare:)

1h9531

Jayden Teoh@jayden_teoh_

4/ Planning.

Next-token prediction is myopic.

On the Path-Star planning task, NextLat is the only method that solves all graph configurations.

Why? Belief states forces the model to plan ahead—not just predict the next token.

5h47391

Jayden Teoh@jayden_teoh_

7/ MTP has become a standard ingredient in open-source language model pretraining.

NextLat offers everything MTP does — and more: stronger representations, faster training and inference, better data efficiency!

It's time that frontier labs seriously consider NextLat as an alternative to MTP for pretraining! 😃

5h44591

Jayden Teoh@jayden_teoh_

5/ Language Modeling.

NextLat improves language modeling without sacrificing next-token perplexity.

Moreover, representations learned by NextLat encode more predictive information about the future—up to 20 tokens ahead!

5h44491

Jayden Teoh@jayden_teoh_

2/ Why Next-Latent Prediction?

Transformers can attend to the entire past, so they have little pressure to compress history into a compact state. They often learn shortcut solutions.

🛠️ NextLat fixes this with a few key benefits:

1) 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴: NextLat encourages transformers to compress history into compact belief states. 2) 𝗕𝗲𝘁𝘁𝗲𝗿 𝗗𝗮𝘁𝗮 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: predicting in latent space provides denser supervision than predicting one-hot tokens. 3) 𝗙𝗮𝘀𝘁𝗲𝗿 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲: via recursive multi-step lookahead.

5h56181

Jayden Teoh@jayden_teoh_

1/ How does Next-Latent Prediction work?

On top of next-token prediction, NextLat trains the transformer to predict its own next latent state given the current latent state and next token.

💡One simple change. Yet, surprisingly large downstream benefits.

5h75012

Jayden Teoh@jayden_teoh_

3/ World Modeling.

🌏 We train the models on Manhattan taxi ride sequences. NextLat learns a world model that is not only more compact, but also more consistent with the real world!

5h5078

Jayden Teoh@jayden_teoh_

6/ Free Lunch: Speculative Decoding.

Because NextLat learns a latent dynamics, it unlocks 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲-𝗹𝗲𝗻𝗴𝘁𝗵 𝘀𝗽𝗲𝗰𝘂𝗹𝗮𝘁𝗶𝘃𝗲 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴: we can draft a flexible number of future tokens recursively in latent space.

🚀 This accelerates inference by up to 3.3x, much faster than MTP-style drafting!

5h4258

Jayden Teoh@jayden_teoh_

Both are self-supervised learning methods. JEPA is more closely related to pulling related views closer in latent space. NextLat focuses more on teaching the model to compress history into belief states and learn markovian latent dynamics. I'd say NextLat is closer to self-predictive RL literature :)

2h2076

Vincent Herrmann@idivinci

@jayden_teoh_ @eric_alcaide @robert_csordas @SchmidhuberAI We're also predicting the next latent state in time, not reconstructing just it. From what I can tell, the only difference is that we're using Gaussian instead of a Dirac distribution, and that we only predict one step in advance, not multiple.

1h382

Diego@didacum333

@jayden_teoh_ Noice. Isnt this close to JEPA?

2h2183

Arun Sharma@arundsharma

@jayden_teoh_ Will they be able to express their belief states in natural language tokens?

I don't know of a model that can reliably explain what percentage of its weights are arts vs science vs history.

2h1373

Jayden Teoh@jayden_teoh_

@arundsharma they will express these belief states in latent space, and hopefully the output language tokens will reflect that

2h1063

rish (in sf)@rishabh16_

@jayden_teoh_ The children secretly yearn for LSTMs

3h923

boop@iluvrlhf

@jayden_teoh_ amazing stuff as always goat

4h2372

Noah Vandal@noah_vandal

@jayden_teoh_ this is really awesome, very nice works

3h1652

Jayden Teoh@jayden_teoh_

Hi Vincent, thanks for your comment. I'm familiar with your work! I liked it a lot actually. However, your hidden state prediction is more like "hidden state reconstruction". Here in NextLat, we are asking it to predict the next latent state in time, not reconstructing the same one. Would love to know your thoughts on our paper :)

1h912