/Tech1d ago

Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models

Practitioners can trade depth for width to bypass bottlenecks.

701K115786229.3K
Original post
will depue@willdepue#315inTech

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1:45 PM · Jun 8, 2026 · 34.3K Views
Sentiment

Users praised supervised memory training replacing BPTT for parallel RNN optimization because it enables efficient long-range sequence modeling and shows strong results beyond traditional recurrence.

Pos
96.3%
Neg
3.7%
12 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS14.8KLIKES42REPLIES6
Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1dViews 14.8KLikes 42Bookmarks 24
BOOKMARKS27
Jayden Teoh@jayden_teoh_

@akarshkumar0101 Awesome stuff. We also showed that you can pre-train a RNN without recurrence by using the transformer backbone to forecast latent states and training the RNN on one-step latent predictions in Next-Latent Prediction Transformers (https://arxiv.org/abs/2511.05963)

2dViews 1.6KLikes 32Bookmarks 27
RETWEETS111
Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

2dViews 167.5KLikes 776Bookmarks 655
will depue@willdepue

in reference to this post which has sparked a lot of BPTT talk today

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

1dViews 5.9KLikes 20Bookmarks 13
Akarsh Kumar@akarshkumar0101

SMT is akin to off-policy behavior cloning, and is mainly for pretraining.

To stabilize RNN rollouts, we introduce an on-policy imitation algo: DAgger Memory Training (DMT), a relatively lightweight fine-tuning phase.

2dViews 3.1KLikes 29Bookmarks 8
Akarsh Kumar@akarshkumar0101

Long Range Memory

Encoder+decoder are Transformers and can lookup any token in the past and future and associate them immediately via attention (O(1) gradient path).

This solves vanishing gradients (left).

With this, SMT can learn long-range memory and even train next-pixel prediction RNNs (right).

2dViews 3.6KLikes 32Bookmarks 5
Akarsh Kumar@akarshkumar0101

In scaling laws, the y-axis is often loss. But what if it was instead compression?

In SMT, increasing training compute allows you to get to the same loss, but with a smaller memory state size.

This is a new way to spend your compute.

2dViews 2.3KLikes 21Bookmarks 4
Akarsh Kumar@akarshkumar0101

Thanks to @phillip_isola for inspiring me to pursue this direction in depth and providing invaluable guidance!

2dViews 1.8KLikes 21Bookmarks 4
Akarsh Kumar@akarshkumar0101

SMT+DMT are a fundamental improvement over BPTT because they perform credit assignment across a sequence in a qualitatively different way (without recurrence).

Check out the paper for many more experiments and insights.

2dViews 1.9KLikes 20Bookmarks 3
will depue@willdepue

@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

1dViews 1.5KLikes 34Bookmarks 2
Akarsh Kumar@akarshkumar0101

Time-parallelism

SMT is fully time-parallel, making it efficient on GPUs.

SMT outperforms BPTT in sequential computation required to achieve a certain loss.

2dViews 2.5KLikes 23Bookmarks 1
Akarsh Kumar@akarshkumar0101

SMT estimates the oracle via a time-parallel encoder trained to embed past context into a representation that a decoder can use to predict the future.

This creates memory states that remember important info and purposefully forget unimportant details, similar to biological memory.

2dViews 4.6KLikes 24Bookmarks 1
Amir Zamir@zamir_ar

RNN sans backdrop through time. Besides addressing some of the core issues that make learning long-range recurrence hard, this is a natural and scalable way to learn a good representation.

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

1dViews 1.3KLikes 6Bookmarks 3
will depue@willdepue

bruh

1dViews 3.4KLikes 15Bookmarks 0
Francois Chaubard@FrancoisChauba1

@akarshkumar0101 I like anything getting us off of BPTT.. but.. what if the oracle doesnt exist. what if we are trying to solve a class of problems humans dont know how to solve. then there is no trace to train on. thats what we have to solve.

2dViews 1.2KLikes 3Bookmarks 2
ueaj@_ueaj

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed

1dViews 526Likes 15

@willdepue I'm confused, how are 1M context models not already extreme BPTT?

I think the main difficulty is not in training, but in infra to handle the memory issue.

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1dViews 1.1KLikes 4Bookmarks 1

@akarshkumar0101 also works very nicely for probabilistic forecasting: https://github.com/kashif/gluon-ts/blob/d22fd44a25853c9f8d5b62fa2c061edea2607bf9/examples/smt_vs_deepar.ipynb

1dViews 274Likes 4Bookmarks 2
Yura Kuratov@yurakuratov

@akarshkumar0101 Have you seen MemUP? https://arxiv.org/abs/2207.13649 It allows RNNs to learn long-range dependencies without BPTT.

2dViews 267Likes 4Bookmarks 2
Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1dViews 814Likes 5Bookmarks 1
Load more posts