/Tech18h ago

Akarsh Kumar and MIT's Phillip Isola introduce Supervised Memory Training to train nonlinear RNNs without backpropagation through time

AI Judge changed title after evaluation, original title: "MIT's Phillip Isola proposes Supervised Memory Training to train RNNs in parallel without backpropagation through time"

It enables time-parallel training by bypassing sequential unrolling.

38785106574131.5K

#1157

Original post

Phillip Isola@phillip_isola#1157inTech

We introduce a method for training RNNs that is time-parallel and does not suffer from vanishing/exploding gradients.

Key idea is to decouple learning 1) what should be remembered (can be done without recurrence) and 2) how to update memory (can be one-step supervised by #1).

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

2:35 PM · Jun 7, 2026 · 4.5K Views

/Tech18h ago

Akarsh Kumar and MIT's Phillip Isola introduce Supervised Memory Training to train nonlinear RNNs without backpropagation through time

AI Judge changed title after evaluation, original title: "MIT's Phillip Isola proposes Supervised Memory Training to train RNNs in parallel without backpropagation through time"

It enables time-parallel training by bypassing sequential unrolling.

38785106574131.5K

#1157

Original post

Phillip Isola@phillip_isola#1157inTech

We introduce a method for training RNNs that is time-parallel and does not suffer from vanishing/exploding gradients.

Key idea is to decouple learning 1) what should be remembered (can be done without recurrence) and 2) how to update memory (can be one-step supervised by #1).

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

2:35 PM · Jun 7, 2026 · 4.5K Views

Sentiment

Positive users praise Supervised Memory Training as a fundamental improvement over BPTT for RNN optimization because it performs credit assignment across sequences in a qualitatively different way.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.5K

Akarsh Kumar@akarshkumar0101

SMT estimates the oracle via a time-parallel encoder trained to embed past context into a representation that a decoder can use to predict the future.

This creates memory states that remember important info and purposefully forget unimportant details, similar to biological memory.

18h1.5K9

BOOKMARKS8RETWEETS2

Jayden Teoh@jayden_teoh_

@akarshkumar0101 Awesome stuff. We also showed that you can pre-train a RNN without recurrence by using the transformer backbone to forecast latent states and training the RNN on one-step latent predictions in Next-Latent Prediction Transformers (https://arxiv.org/abs/2511.05963)

16h508118

LIKES12

Akarsh Kumar@akarshkumar0101

SMT is akin to off-policy behavior cloning, and is mainly for pretraining.

To stabilize RNN rollouts, we introduce an on-policy imitation algo: DAgger Memory Training (DMT), a relatively lightweight fine-tuning phase.

18h662124

REPLIES1

Francois Chaubard@FrancoisChauba1

@akarshkumar0101 I like anything getting us off of BPTT.. but.. what if the oracle doesnt exist. what if we are trying to solve a class of problems humans dont know how to solve. then there is no trace to train on. thats what we have to solve.

12h174

Akarsh Kumar@akarshkumar0101

Long Range Memory

Encoder+decoder are Transformers and can lookup any token in the past and future and associate them immediately via attention (O(1) gradient path).

This solves vanishing gradients (left).

With this, SMT can learn long-range memory and even train next-pixel prediction RNNs (right).

18h1.2K123

Akarsh Kumar@akarshkumar0101

Time-parallelism

SMT is fully time-parallel, making it efficient on GPUs.

SMT outperforms BPTT in sequential computation required to achieve a certain loss.

18h792101

Akarsh Kumar@akarshkumar0101

In scaling laws, the y-axis is often loss. But what if it was instead compression?

In SMT, increasing training compute allows you to get to the same loss, but with a smaller memory state size.

This is a new way to spend your compute.

18h71271

Akarsh Kumar@akarshkumar0101

Thanks to @phillip_isola for inspiring me to pursue this direction in depth and providing invaluable guidance!

18h59652

Phillip Isola@phillip_isola

What should be remembered: a compressed representation of the past that predicts the future (predictive state).

How to update memory: predict the next predictive state.

Phillip Isola@phillip_isola

We introduce a method for training RNNs that is time-parallel and does not suffer from vanishing/exploding gradients.

Key idea is to decouple learning 1) what should be remembered (can be done without recurrence) and 2) how to update memory (can be one-step supervised by #1).

17h94281

Akarsh Kumar@akarshkumar0101

SMT+DMT are a fundamental improvement over BPTT because they perform credit assignment across a sequence in a qualitatively different way (without recurrence).

Check out the paper for many more experiments and insights.

18h6328

Akarsh Kumar@akarshkumar0101

@vincesitzmann Thanks Vincent!

16h2591

Akarsh Kumar@akarshkumar0101

@MinqiJiang Thanks Minqi!

13h3062

secemp@secemp9

@akarshkumar0101 cc @neurallambda

10h97

Jiaqi Feng@FengLeader

@vincesitzmann For AR we use embeddings; for diffusion we use encoders/decoders. Yet for hybrid AR-diffusion models like recent world models, we know too little about what makes a good encoder.

16h82

Vincent@InsiderPresider

@danfei_xu @phillip_isola this work is actually valid but does smt really hold up against transformers in the long run anyway

16h54

Leandro Morel@MorelLeand78015

@FrancoisChauba1 @akarshkumar0101 There is a mechanism for reconstruction although how to implement it that's a different matter. It is the experiments section.

https://github.com/Lexlangel/Interaction-dynamics-core/tree/main

12h15

Suresh@_Suresh2

@RobertTLange the compression step sounds like it'd kill inference latency, even with a small transformer

4h8

Agustin Fonzo@agufonzo

@MinqiJiang My uncle says DMT vapes fixed him after the Afghanistan war He highly commends it's usefulness @ Thedlspensary on x is still his go to for psychedelic needs like LSD, DMT and more.

11h2