Will Depue, who worked on OpenAI's Sora, argues backpropagation through time has no fundamental limits for training deep networks

VIEWS8.5KBOOKMARKS14LIKES28REPLIES6

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

3h8.5K2814

will depue@willdepue

in reference to this post which has sparked a lot of BPTT talk today

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

3h4.4K1310

will depue@willdepue

@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

3h905242

bayes@bayeslord

@willdepue it does work!

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

5h2.9K30

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

2h37240

bilal@bilaltwovec

@vitaliychiley you can't call it deep learning if you dont have at least 152 layers

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

3h36130

will depue@willdepue

@recurseparadox you're saying aspect ratio matters for efficiency, right? i'm just saying, as i assume you'd agree, that depth & width are surprisingly fungible. and this gives some potential avenue to reducing difficulty of BPTT

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

2h27700

will depue@willdepue

@bayeslord i mean more in the extreme case, if scaled

4h57

Strata@ChainZenit

@willdepue this is a solid take, the scaling potential is actually wild.

5h33

Alex YGift@Radipdegen

@willdepue "a hard time believing" is doing a lot of work there tbh

hows that supervision going in prod?

4h11