/AI8h ago

Will Depue, who worked on OpenAI's Sora, argues objections to Backpropagation Through Time are unfounded for optimizing trillion-parameter models

Practitioners can trade depth for width to bypass bottlenecks.

25219411053.7K
Original post
will depue@willdepue#249inAI

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

1:45 PM · Jun 8, 2026 · 29.1K Views
Sentiment

Positive users praise the defense of BPTT for scaling deep nets and long sequences due to its strong potential and proven effectiveness, while the negative reply questions supervision feasibility in production.

Pos
66.7%
Neg
33.3%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS12.8KBOOKMARKS20LIKES35REPLIES6
Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

7hViews 12.8KLikes 35Bookmarks 20
will depue@willdepue

in reference to this post which has sparked a lot of BPTT talk today

Akarsh Kumar@akarshkumar0101

We never really knew how to train nonlinear RNNs well… BPTT struggled with vanishing grads (no long-range memory) and sequential rollout (hard to parallelizable).

What if instead an oracle told us the optimal memory state m_t at each step? Then the RNN could do one-step supervised learning on (m_t, x_{t+1}) → m_{t+1} labels.

We call this Supervised Memory Training (SMT): a replacement for BPTT that trains RNNs without unrolling them. SMT is time-parallelizable and solves vanishing gradients.

Website: https://akarshkumar.com/smt/ arXiv: https://arxiv.org/abs/2606.06479

6hViews 5.4KLikes 17Bookmarks 11
will depue@willdepue

@vitaliychiley dude i took kimi 2.6s architecture and tweeted it, theres literally no oai arch information here

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

7hViews 1.4KLikes 31Bookmarks 2
will depue@willdepue

bruh

6hViews 3.1KLikes 15Bookmarks 0
Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

6hViews 674Likes 6Bookmarks 1
ueaj@_ueaj

@vitaliychiley you're telling me they're doing deep learning & MoEs at OpenAI? wow I couldn't have guessed

6hViews 162Likes 5
bayes@bayeslord

@willdepue it does work!

will depue@willdepue

a lot of BPTT hate is dumb. we can capably supervise deep neural nets with trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections. i have a hard time believing that theres any fundamental limitation to optimize large numbers of sequential steps through time, in appropriate settings. BPTT can and should work even if many sequential steps are a true bottleneck, great, you can usually just trade depth for width, your network won’t care. you can run the sweep yourself: models don’t mind being super tall and skinny or super wide and short, it almost doesn’t matter i’m bullish on extreme BPTT returning. i see no reason why we can’t BPTT chains of thought, for example. you could even just initialize with a CoT model, throw out the embedding layers, and just try finetuning for continuous latents. it should surely work for short CoTs at least

8hViews 3.1KLikes 3Bookmarks 0
sasuke⚡420@sasuke___420

@willdepue @vitaliychiley *taking notes* twenty-three.. or fewer...

6hViews 46Likes 1
bilal@bilaltwovec

@vitaliychiley you can't call it deep learning if you dont have at least 152 layers

Vitaliy Chiley@vitaliychiley

"trillions of parameters and many dozens of layers with hundreds of gated ‘expert’ network sub-sections"

vague posting about OAI arch like this is crazy "dozens of layers" might be most telling (granted everyone who needs to know already knows)

7hViews 426Likes 3Bookmarks 0
will depue@willdepue

@bayeslord i mean more in the extreme case, if scaled

bayes@bayeslord

@willdepue it does work!

8hViews 899Likes 0Bookmarks 0
will depue@willdepue

@recurseparadox you're saying aspect ratio matters for efficiency, right? i'm just saying, as i assume you'd agree, that depth & width are surprisingly fungible. and this gives some potential avenue to reducing difficulty of BPTT

Pranav Shyam@recurseparadox

@willdepue This is not true. Your arch and depth claims were true in pre-2023 era not now. Depth absolutely matters. Parity problem can be solved by a random RNN but not by transformer.

There’s also no BPTT hate. It’s just slow

5hViews 516Likes 0Bookmarks 0
sasuke⚡420@sasuke___420

@vitaliychiley haha yeah i think he doesn't expect that he's leaking anything

7hViews 126Likes 1
will depue@willdepue

@sasuke___420 @vitaliychiley shhhh…. it’s an irrational number… sqrt(41)

6hViews 18Likes 1
Strata@ChainZenit

@willdepue this is a solid take, the scaling potential is actually wild.

8hViews 33
Charles the Fool@charlesthefool

@_ueaj @vitaliychiley With trillions of parameters at that, according to my estimates!

6hViews 8Likes 1
Alex YGift@Radipdegen

@willdepue "a hard time believing" is doing a lot of work there tbh

hows that supervision going in prod?

8hViews 11
Zero Void@0x00_void

@vitaliychiley every oai vague-post adds a trillion parameters to the rumor

6hViews 8