a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least
Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models
Practitioners can trade depth for width to bypass bottlenecks.
Users praised supervised memory training replacing BPTT for parallel RNN optimization because it enables efficient long-range sequence modeling and shows strong results beyond traditional recurrence.
Most Activity
"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"
vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

@akarshkumar0101 Awesome stuff. We also showed that you can pre-train a RNN without recurrence by using the transformer backbone to forecast latent states and training the RNN on one-step latent predictions in Next-Latent Prediction Transformers (https://arxiv.org/abs/2511.05963)
We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).
What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.
We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.
Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479
in reference to this post which has sparked a lot of BPTT talk today
We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).
What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.
We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.
Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

SMT is akin to off-policy behavior cloning, and is mainly for pretraining.
To stabilize RNN rollouts, we introduce an on-policy imitation algo: DAgger Memory Training (DMT), a relatively lightweight fine-tuning phase.

Long Range Memory
Encoder+decoder are Transformers and can lookup any token in the past and future and associate them immediately via attention (O(1) gradient path).
This solves vanishing gradients (left).
With this, SMT can learn long-range memory and even train next-pixel prediction RNNs (right).

In scaling laws, the y-axis is often loss. But what if it was instead compression?
In SMT, increasing training compute allows you to get to the same loss, but with a smaller memory state size.
This is a new way to spend your compute.

Thanks to @phillip_isola for inspiring me to pursue this direction in depth and providing invaluable guidance!

SMT+DMT are a fundamental improvement over BPTT because they perform credit assignment across a sequence in a qualitatively different way (without recurrence).
Check out the paper for many more experiments and insights.
@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here
"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"
vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

Time-parallelism
SMT is fully time-parallel, making it efficient on GPUs.
SMT outperforms BPTT in sequential computation required to achieve a certain loss.

SMT estimates the oracle via a time-parallel encoder trained to embed past context into a representation that a decoder can use to predict the future.
This creates memory states that remember important info and purposefully forget unimportant details, similar to biological memory.
RNN sans backdrop through time. Besides addressing some of the core issues that make learning long-range recurrence hard, this is a natural and scalable way to learn a good representation.
We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).
What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.
We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.
Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479
bruh

@akarshkumar0101 I like anything getting us off of BPTT.. but.. what if the oracle doesnt exist. what if we are trying to solve a class of problems humans dont know how to solve. then there is no trace to train on. thats what we have to solve.

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed
@willdepue I'm confused, how are 1M context models not already extreme BPTT?
I think the main difficulty is not in training, but in infra to handle the memory issue.
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

@akarshkumar0101 also works very nicely for probabilistic forecasting: https://github.com/kashif/gluon-ts/blob/d22fd44a25853c9f8d5b62fa2c061edea2607bf9/examples/smt_vs_deepar.ipynb

@akarshkumar0101 Have you seen MemUP? https://arxiv.org/abs/2207.13649 It allows RNNs to learn long-range dependencies without BPTT.
@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.
There’s also no BPTT hate. It’s just slow
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least