a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least
Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models
Practitioners can trade depth for width to bypass bottlenecks.
Positive users praise the defense of BPTT for scaling deep nets and long sequences due to its strong potential and proven effectiveness, while the negative reply questions supervision feasibility in production.
Most Activity
"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"
vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least
in reference to this post which has sparked a lot of BPTT talk today
We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).
What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.
We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.
Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479
@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here
"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"
vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)
bruh
@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.
There’s also no BPTT hate. It’s just slow
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed
@willdepue it does work!
a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

@willdepue @vitaliychiley *taking notes* twenty-three.. or fewer...
@vitaliychiley you can't call it deep learning if you dont have at least 152 layers
"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"
vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

@vitaliychiley 😆
@bayeslord i mean more in the extreme case, if scaled
@willdepue it does work!

@_ueaj so much alpha
@recurseparadox you're saying aspect ratio matters for efficiency, right? i'm just saying, as i assume you'd agree, that depth & width are surprisingly fungible. and this gives some potential avenue to reducing difficulty of BPTT
@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.
There’s also no BPTT hate. It’s just slow

@vitaliychiley haha yeah i think he doesn't expect that he's leaking anything

@sasuke___420 @vitaliychiley shhhh…. it’s an irrational number… sqrt(41)

@willdepue this is a solid take, the scaling potential is actually wild.

@_ueaj @vitaliychiley With trillions of parameters at that, according to my estimates!

@willdepue "a hard time believing" is doing a lot of work there tbh
hows that supervision going in prod?

@vitaliychiley every oai vague-post adds a trillion parameters to the rumor