/Tech2d ago

Will Depue, former OpenAI Sora engineer, proposes reviving fast and slow weights for modern continual learning

Research engineer @kalomaze suggested adapting multidimensional GAN frameworks.

1370142.8K

#260

Original post

will depue@willdepue#315inTech

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

5:31 PM · Jun 8, 2026 · 12.2K Views

/Tech2d ago

Will Depue, former OpenAI Sora engineer, proposes reviving fast and slow weights for modern continual learning

Research engineer @kalomaze suggested adapting multidimensional GAN frameworks.

1370142.8K

#260

Original post

will depue@willdepue#315inTech

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

5:31 PM · Jun 8, 2026 · 12.2K Views

Sentiment

Users are enthusiastic about reviving fast weights for continual learning because older modular ideas like mixtures of experts seem poised for a practical comeback.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2KBOOKMARKS14LIKES24

Rohan Pandey@khoomeik

but will, we already have fast weights at home

will depue@willdepue

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

1d2K2414

RETWEETS1

Rishabh Tiwari@rish2k1

@willdepue We got super interesting results/properties with fast-slow learning formulation. Paper link: https://arxiv.org/abs/2605.12484

1d20331

REPLIES3

mike64_t@mike64_t

@willdepue The fact that ultimately there is no difference between activations and weights in the limit has interesting consequences

2d1.1K234

Daryl@allvibesnoskill

@willdepue We will be back GAN bros

2d42613

will depue@willdepue

@allvibesnoskill i'm incredibly bullish on adversarial methods returning soon!

1d35591

kalomaze@kalomaze

@willdepue gan intuitions but make it multidimensional

will depue@willdepue

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

1d770130

Aryaman Arora@aryaman2020

@mike64_t @willdepue wait what is this referring to

1d21721

bilal@bilaltwovec

@mike64_t @willdepue thought its funny how you brought this up now given the discourse du jour lol https://arxiv.org/abs/2106.06199

2d18921

karthik@akbirthko

@mike64_t @willdepue we bringing back 2024 discourse w/ this one. how funny if he get the last laugh?

1d2335

karthik@akbirthko

@mike64_t @willdepue fanfic during ssm craze, i wonder if thinky is going for this @alth0u

1d521

Sir Mr Meow Meow@SirMrMeowmeow

the old idea that some state should change quickly while deeper priors change slowly feels obviously right, just not solved at scale yet

my bias is toward [hierarchical] recurrence/state/latent memory here

RAG-style memory is useful, but it’s still mostly “retrieve text and paste it into context.” the more interesting version is a system with persistent internal state: something that updates, decays, compresses, and changes future computation

fast weights / slow weights feel like one old path toward that

fast state = recent trajectory medium state = reusable episodes / temporary adapters slow state = consolidated priors

the latent part because real memory probably shouldn’t always be stored as text. sometimes it should be a compressed residue of repeated computation,, what keeps mattering, what keeps recurring, what should become easier next time

the history of stateful archs is interesting...

RNNs -> frequent updates, coherence doesn't travel far. Recurrence can learn to ferry features, but long-range credit assignment gets ugly.

HRNNs / Clockwork RNNs -> hierarchy, slower/faster timescales, but still entangled & smeared past.

LSTMs & OpenAI Five -> gated recurrence made state more usable. You get persistent hidden state, selective write/forget dynamics, and enough temporal continuity for tactics, opponent modeling, cooldowns, positioning, and “what was I doing 20 seconds ago?” OpenAI-Five is especially interesting here because the policy wasn’t just re-reading a transcript of the match; it had recurrent state folded into the policy loop. Not perfect memory, but actual operational state.

Transformers -> insane associative recall inside context, but weirdly stateless between calls unless you bolt memory/state on from the outside. Context is not the same thing as a continuously updated latent state. A transcript can describe your past, but it is not the same as carrying forward a compact internal dynamics vector.

Hierarchical VLM pairs (*which are particularly interesting...*) -> S1 is the fast visuomotor policy translating S2’s latent semantic representations into continuous actions at ~200 Hz. It's not just words/text being sent from S2 to S1. S2 learns to best send semantic/policy-conditioning, S1 learns best to interpret it, which helps reduce brittle memorization and increase generalization. There is some level of state continuity, though technically the top-level S2 may still be stateless-ish per step lol.

So yeah strong suspicions that state and hierarchy may be helpful. :x 🧐🤔

continual learning probably ends up as multi-rate memory like as in in google's titans/Miras, or the memory layers as in their MAL looks interesting/promising imo. or maybe some TTT-online lora thing.

1d1284

patrick@ConsumerRick

@willdepue JEPA

2d2592

Aryaman Arora@aryaman2020

@mike64_t @willdepue ohh I see. because activations are data-dependent but weights are not, so fast weights \approx activations?

1d1032

mike64_t@mike64_t

@akbirthko @willdepue oh he will for sure

1d1002

Sir Mr Meow Meow@SirMrMeowmeow

likely latent state and hierarchy around the ar core*. i wouldn't try to ditch that at this point lol.

some other interesting bits which make me feel hmm: like the Larimar paper (Larimar: BERT-style encoder writes facts into external memory, whose readout conditions GPT-2 or a GPT-style decoder without weight edits.)

and various steering/interp papers using or manipulating reps at various levels to augment the ar core and manipulate the residual stream

Larimar is another reminder that memory can be a learned memory interface around the AR core: write/update/forget mechanisms whose readouts condition generation.

=== >Larimar uses a BERT-style encoder during training and memory writing, but the decoder/base LM is not updated during fact editing. The “memory” gets written/updated, then its readout conditions the decoder. The Larimar paper had three modules: encoder, associative memory, decoder, trained together; then new facts can be added in one shot without retraining/fine-tuning the LLM.

1d293

Maxence Frenette@maxencefrenette

@willdepue RNNs. They kinda are already, with RWKV, Mamba, GDN, etc. (I don’t think SSMs are technically classified as RNNs, but they share the constant state size property.)

2d2921

mike64_t@mike64_t

@aryaman2020 @willdepue The tweet above of course

1d1231

lumi@agitbackprop

@willdepue the fast weights are stored in the KV cache

1d1175

mike64_t@mike64_t

@aryaman2020 @willdepue yep fast weights approx. weights. Which is related to scaling BPTT. It’s also related to emergent meta learning. Ultimate Schmidhuber victory. He truly invented everything 🫡

1d1175

mike64_t@mike64_t

@bilaltwovec @willdepue “Do not fall for local gradient methods” 😭

2d1131