Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models

VIEWS12.8KBOOKMARKS20LIKES35REPLIES6

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

7h12.8K3520

will depue@willdepue

in reference to this post which has sparked a lot of BPTT talk today

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

6h5.4K1711

will depue@willdepue

@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

7h1.4K312

will depue@willdepue

bruh

6h3.1K150

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

6h67461

ueaj@_ueaj

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed

6h1625

bayes@bayeslord

@willdepue it does work!

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

8h3.1K30

sasuke⚡420@sasuke___420

@willdepue @vitaliychiley *taking notes* twenty-three.. or fewer...

6h461

bilal@bilaltwovec

@vitaliychiley you can't call it deep learning if you dont have at least 152 layers

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

7h42630

Asher Spector@amspector100

@vitaliychiley 😆

7h1712

will depue@willdepue

@bayeslord i mean more in the extreme case, if scaled

bayes@bayeslord

@willdepue it does work!

8h89900

Vitaliy Chiley@vitaliychiley

@_ueaj so much alpha

6h822

will depue@willdepue

@recurseparadox you're saying aspect ratio matters for efficiency, right? i'm just saying, as i assume you'd agree, that depth & width are surprisingly fungible. and this gives some potential avenue to reducing difficulty of BPTT

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

5h51600

sasuke⚡420@sasuke___420

@vitaliychiley haha yeah i think he doesn't expect that he's leaking anything

7h1261

will depue@willdepue

@sasuke___420 @vitaliychiley shhhh…. it’s an irrational number… sqrt(41)

6h181

Strata@ChainZenit

@willdepue this is a solid take, the scaling potential is actually wild.

8h33

Charles the Fool@charlesthefool

@_ueaj @vitaliychiley With trillions of parameters at that, according to my estimates!

6h81

Alex YGift@Radipdegen

@willdepue "a hard time believing" is doing a lot of work there tbh

hows that supervision going in prod?

8h11

Zero Void@0x00_void

@vitaliychiley every oai vague-post adds a trillion parameters to the rumor

6h8