Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models

VIEWS14.8KLIKES42REPLIES6

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1d14.8K4224

BOOKMARKS27

Jayden Teoh@jayden_teoh_

@akarshkumar0101 Awesome stuff. We also showed that you can pre-train a RNN without recurrence by using the transformer backbone to forecast latent states and training the RNN on one-step latent predictions in Next-Latent Prediction Transformers (https://arxiv.org/abs/2511.05963)

2d1.6K3227

RETWEETS111

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

2d167.5K776655

will depue@willdepue

in reference to this post which has sparked a lot of BPTT talk today

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

1d5.9K2013

Akarsh Kumar@akarshkumar0101

SMT is akin to off-policy behavior cloning, and is mainly for pretraining.

To stabilize RNN rollouts, we introduce an on-policy imitation algo: DAgger Memory Training (DMT), a relatively lightweight fine-tuning phase.

2d3.1K298

Akarsh Kumar@akarshkumar0101

Long Range Memory

Encoder+decoder are Transformers and can lookup any token in the past and future and associate them immediately via attention (O(1) gradient path).

This solves vanishing gradients (left).

With this, SMT can learn long-range memory and even train next-pixel prediction RNNs (right).

2d3.6K325

Akarsh Kumar@akarshkumar0101

In scaling laws, the y-axis is often loss. But what if it was instead compression?

In SMT, increasing training compute allows you to get to the same loss, but with a smaller memory state size.

This is a new way to spend your compute.

2d2.3K214

Akarsh Kumar@akarshkumar0101

Thanks to @phillip_isola for inspiring me to pursue this direction in depth and providing invaluable guidance!

2d1.8K214

Akarsh Kumar@akarshkumar0101

SMT+DMT are a fundamental improvement over BPTT because they perform credit assignment across a sequence in a qualitatively different way (without recurrence).

Check out the paper for many more experiments and insights.

2d1.9K203

will depue@willdepue

@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

1d1.5K342

Akarsh Kumar@akarshkumar0101

Time-parallelism

SMT is fully time-parallel, making it efficient on GPUs.

SMT outperforms BPTT in sequential computation required to achieve a certain loss.

2d2.5K231

Akarsh Kumar@akarshkumar0101

SMT estimates the oracle via a time-parallel encoder trained to embed past context into a representation that a decoder can use to predict the future.

This creates memory states that remember important info and purposefully forget unimportant details, similar to biological memory.

2d4.6K241

Amir Zamir@zamir_ar

RNN sans backdrop through time. Besides addressing some of the core issues that make learning long-range recurrence hard, this is a natural and scalable way to learn a good representation.

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

1d1.3K63

will depue@willdepue

bruh

1d3.4K150

Francois Chaubard@FrancoisChauba1

@akarshkumar0101 I like anything getting us off of BPTT.. but.. what if the oracle doesnt exist. what if we are trying to solve a class of problems humans dont know how to solve. then there is no trace to train on. thats what we have to solve.

2d1.2K32

ueaj@_ueaj

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed

1d52615

Lucas Beyer (bl16)@giffmana

@willdepue I'm confused, how are 1M context models not already extreme BPTT?

I think the main difficulty is not in training, but in infra to handle the memory issue.

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1d1.1K41

Kashif Rasul@krasul

@akarshkumar0101 also works very nicely for probabilistic forecasting: https://github.com/kashif/gluon-ts/blob/d22fd44a25853c9f8d5b62fa2c061edea2607bf9/examples/smt_vs_deepar.ipynb

1d27442

Yura Kuratov@yurakuratov

@akarshkumar0101 Have you seen MemUP? https://arxiv.org/abs/2207.13649 It allows RNNs to learn long-range dependencies without BPTT.

2d26742

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1d81451