/AI7h ago

Will Depue, former OpenAI Sora engineer, proposes reviving fast and slow weights for modern continual learning

Research engineer @kalomaze suggested adapting multidimensional GAN frameworks.

1913856710.2K
Original post
will depue@willdepue#249inAI

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

5:31 PM · Jun 8, 2026 · 9.7K Views
Sentiment

Users are excited about reviving fast and slow weights for continual learning because they expect a comeback for modular techniques like GANs, mixtures of experts, and adversarial methods.

Pos
100.0%
Neg
0.0%
10 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.1KBOOKMARKS4LIKES23REPLIES3
mike64_t@mike64_t

@willdepue The fact that ultimately there is no difference between activations and weights in the limit has interesting consequences

6hViews 1.1KLikes 23Bookmarks 4
RETWEETS1

@willdepue We got super interesting results/properties with fast-slow learning formulation. Paper link: https://arxiv.org/abs/2605.12484

3hViews 203Likes 3Bookmarks 1
Daryl@allvibesnoskill

@willdepue We will be back GAN bros

6hViews 426Likes 13
will depue@willdepue

@allvibesnoskill i'm incredibly bullish on adversarial methods returning soon!

5hViews 355Likes 9Bookmarks 1
Aryaman Arora@aryaman2020

@mike64_t @willdepue wait what is this referring to

6hViews 217Likes 2Bookmarks 1
bilal@bilaltwovec

@mike64_t @willdepue thought its funny how you brought this up now given the discourse du jour lol https://arxiv.org/abs/2106.06199

6hViews 189Likes 2Bookmarks 1
kalomaze@kalomaze

@willdepue gan intuitions but make it multidimensional

will depue@willdepue

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

4hViews 627Likes 11Bookmarks 0
karthik@akbirthko

@mike64_t @willdepue we bringing back 2024 discourse w/ this one. how funny if he get the last laugh?

5hViews 233Likes 5
karthik@akbirthko

@mike64_t @willdepue fanfic during ssm craze, i wonder if thinky is going for this @alth0u

5hViews 52Likes 1
Sir Mr Meow Meow@SirMrMeowmeow

the old idea that some state should change quickly while deeper priors change slowly feels obviously right, just not solved at scale yet

my bias is toward [hierarchical] recurrence/state/latent memory here

RAG-style memory is useful, but it’s still mostly “retrieve text and paste it into context.” the more interesting version is a system with persistent internal state: something that updates, decays, compresses, and changes future computation

fast weights / slow weights feel like one old path toward that

fast state = recent trajectory medium state = reusable episodes / temporary adapters slow state = consolidated priors

the latent part because real memory probably shouldn’t always be stored as text. sometimes it should be a compressed residue of repeated computation,, what keeps mattering, what keeps recurring, what should become easier next time

the history of stateful archs is interesting...

RNNs -> frequent updates, coherence doesn't travel far. Recurrence can learn to ferry features, but long-range credit assignment gets ugly.

HRNNs / Clockwork RNNs -> hierarchy, slower/faster timescales, but still entangled & smeared past.

LSTMs & OpenAI Five -> gated recurrence made state more usable. You get persistent hidden state, selective write/forget dynamics, and enough temporal continuity for tactics, opponent modeling, cooldowns, positioning, and “what was I doing 20 seconds ago?” OpenAI-Five is especially interesting here because the policy wasn’t just re-reading a transcript of the match; it had recurrent state folded into the policy loop. Not perfect memory, but actual operational state.

Transformers -> insane associative recall inside context, but weirdly stateless between calls unless you bolt memory/state on from the outside. Context is not the same thing as a continuously updated latent state. A transcript can describe your past, but it is not the same as carrying forward a compact internal dynamics vector.

Hierarchical VLM pairs (*which are particularly interesting...*) -> S1 is the fast visuomotor policy translating S2’s latent semantic representations into continuous actions at ~200 Hz. It's not just words/text being sent from S2 to S1. S2 learns to best send semantic/policy-conditioning, S1 learns best to interpret it, which helps reduce brittle memorization and increase generalization. There is some level of state continuity, though technically the top-level S2 may still be stateless-ish per step lol.

So yeah strong suspicions that state and hierarchy may be helpful. :x 🧐🤔

continual learning probably ends up as multi-rate memory like as in in google's titans/Miras, or the memory layers as in their MAL looks interesting/promising imo. or maybe some TTT-online lora thing.

5hViews 128Likes 4
patrick@ConsumerRick

@willdepue JEPA

6hViews 259Likes 2
Aryaman Arora@aryaman2020

@mike64_t @willdepue ohh I see. because activations are data-dependent but weights are not, so fast weights \approx activations?

5hViews 103Likes 2
mike64_t@mike64_t

@akbirthko @willdepue oh he will for sure

5hViews 100Likes 2
Sir Mr Meow Meow@SirMrMeowmeow

likely latent state and hierarchy around the ar core*. i wouldn't try to ditch that at this point lol.

some other interesting bits which make me feel hmm: like the Larimar paper (Larimar: BERT-style encoder writes facts into external memory, whose readout conditions GPT-2 or a GPT-style decoder without weight edits.)

and various steering/interp papers using or manipulating reps at various levels to augment the ar core and manipulate the residual stream

Larimar is another reminder that memory can be a learned memory interface around the AR core: write/update/forget mechanisms whose readouts condition generation.

=== >Larimar uses a BERT-style encoder during training and memory writing, but the decoder/base LM is not updated during fact editing. The “memory” gets written/updated, then its readout conditions the decoder. The Larimar paper had three modules: encoder, associative memory, decoder, trained together; then new facts can be added in one shot without retraining/fine-tuning the LLM.

5hViews 29Likes 3
Maxence Frenette@maxencefrenette

@willdepue RNNs. They kinda are already, with RWKV, Mamba, GDN, etc. (I don’t think SSMs are technically classified as RNNs, but they share the constant state size property.)

6hViews 292Likes 1
mike64_t@mike64_t

@aryaman2020 @willdepue The tweet above of course

6hViews 123Likes 1
lumi@agitbackprop

@willdepue the fast weights are stored in the KV cache

5hViews 117Likes 5
mike64_t@mike64_t

@aryaman2020 @willdepue yep fast weights approx. weights. Which is related to scaling BPTT. It’s also related to emergent meta learning. Ultimate Schmidhuber victory. He truly invented everything 🫡

5hViews 117Likes 5
mike64_t@mike64_t

@bilaltwovec @willdepue “Do not fall for local gradient methods” 😭

6hViews 113Likes 1
bilal@bilaltwovec

@mike64_t @willdepue i like this!

6hViews 58Likes 1
Load more posts