/Tech2d ago

Will Depue, former OpenAI Sora engineer, proposes reviving fast and slow weights for modern continual learning

Research engineer @kalomaze suggested adapting multidimensional GAN frameworks.

1370142.8K
Original post
will depue@willdepue#315inTech

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

5:31 PM · Jun 8, 2026 · 12.2K Views
Sentiment

Users are enthusiastic about reviving fast weights for continual learning because older modular ideas like mixtures of experts seem poised for a practical comeback.

Pos
100.0%
Neg
0.0%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2KBOOKMARKS14LIKES24
Rohan Pandey@khoomeik

but will, we already have fast weights at home

will depue@willdepue

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

1dViews 2KLikes 24Bookmarks 14
RETWEETS1

@willdepue We got super interesting results/properties with fast-slow learning formulation. Paper link: https://arxiv.org/abs/2605.12484

1dViews 203Likes 3Bookmarks 1
REPLIES3
mike64_t@mike64_t

@willdepue The fact that ultimately there is no difference between activations and weights in the limit has interesting consequences

2dViews 1.1KLikes 23Bookmarks 4
Daryl@allvibesnoskill

@willdepue We will be back GAN bros

2dViews 426Likes 13
will depue@willdepue

@allvibesnoskill i'm incredibly bullish on adversarial methods returning soon!

1dViews 355Likes 9Bookmarks 1
kalomaze@kalomaze

@willdepue gan intuitions but make it multidimensional

will depue@willdepue

a friend asked me today which old ideas in machine learning might come back in the future

my immediate thought was fast and slow weights are very elegant and would be cool to see again, perhaps in context of continual learning

curious for others' suggestions

1dViews 770Likes 13Bookmarks 0
Aryaman Arora@aryaman2020

@mike64_t @willdepue wait what is this referring to

1dViews 217Likes 2Bookmarks 1
bilal@bilaltwovec

@mike64_t @willdepue thought its funny how you brought this up now given the discourse du jour lol https://arxiv.org/abs/2106.06199

2dViews 189Likes 2Bookmarks 1
karthik@akbirthko

@mike64_t @willdepue we bringing back 2024 discourse w/ this one. how funny if he get the last laugh?

1dViews 233Likes 5
karthik@akbirthko

@mike64_t @willdepue fanfic during ssm craze, i wonder if thinky is going for this @alth0u

1dViews 52Likes 1
Sir Mr Meow Meow@SirMrMeowmeow

the old idea that some state should change quickly while deeper priors change slowly feels obviously right, just not solved at scale yet

my bias is toward [hierarchical] recurrence/state/latent memory here

RAG-style memory is useful, but it’s still mostly “retrieve text and paste it into context.” the more interesting version is a system with persistent internal state: something that updates, decays, compresses, and changes future computation

fast weights / slow weights feel like one old path toward that

fast state = recent trajectory medium state = reusable episodes / temporary adapters slow state = consolidated priors

the latent part because real memory probably shouldn’t always be stored as text. sometimes it should be a compressed residue of repeated computation,, what keeps mattering, what keeps recurring, what should become easier next time

the history of stateful archs is interesting...

RNNs -> frequent updates, coherence doesn't travel far. Recurrence can learn to ferry features, but long-range credit assignment gets ugly.

HRNNs / Clockwork RNNs -> hierarchy, slower/faster timescales, but still entangled & smeared past.

LSTMs & OpenAI Five -> gated recurrence made state more usable. You get persistent hidden state, selective write/forget dynamics, and enough temporal continuity for tactics, opponent modeling, cooldowns, positioning, and “what was I doing 20 seconds ago?” OpenAI-Five is especially interesting here because the policy wasn’t just re-reading a transcript of the match; it had recurrent state folded into the policy loop. Not perfect memory, but actual operational state.

Transformers -> insane associative recall inside context, but weirdly stateless between calls unless you bolt memory/state on from the outside. Context is not the same thing as a continuously updated latent state. A transcript can describe your past, but it is not the same as carrying forward a compact internal dynamics vector.

Hierarchical VLM pairs (*which are particularly interesting...*) -> S1 is the fast visuomotor policy translating S2’s latent semantic representations into continuous actions at ~200 Hz. It's not just words/text being sent from S2 to S1. S2 learns to best send semantic/policy-conditioning, S1 learns best to interpret it, which helps reduce brittle memorization and increase generalization. There is some level of state continuity, though technically the top-level S2 may still be stateless-ish per step lol.

So yeah strong suspicions that state and hierarchy may be helpful. :x 🧐🤔

continual learning probably ends up as multi-rate memory like as in in google's titans/Miras, or the memory layers as in their MAL looks interesting/promising imo. or maybe some TTT-online lora thing.

1dViews 128Likes 4
patrick@ConsumerRick

@willdepue JEPA

2dViews 259Likes 2
Aryaman Arora@aryaman2020

@mike64_t @willdepue ohh I see. because activations are data-dependent but weights are not, so fast weights \approx activations?

1dViews 103Likes 2
mike64_t@mike64_t

@akbirthko @willdepue oh he will for sure

1dViews 100Likes 2
Sir Mr Meow Meow@SirMrMeowmeow

likely latent state and hierarchy around the ar core*. i wouldn't try to ditch that at this point lol.

some other interesting bits which make me feel hmm: like the Larimar paper (Larimar: BERT-style encoder writes facts into external memory, whose readout conditions GPT-2 or a GPT-style decoder without weight edits.)

and various steering/interp papers using or manipulating reps at various levels to augment the ar core and manipulate the residual stream

Larimar is another reminder that memory can be a learned memory interface around the AR core: write/update/forget mechanisms whose readouts condition generation.

=== >Larimar uses a BERT-style encoder during training and memory writing, but the decoder/base LM is not updated during fact editing. The “memory” gets written/updated, then its readout conditions the decoder. The Larimar paper had three modules: encoder, associative memory, decoder, trained together; then new facts can be added in one shot without retraining/fine-tuning the LLM.

1dViews 29Likes 3
Maxence Frenette@maxencefrenette

@willdepue RNNs. They kinda are already, with RWKV, Mamba, GDN, etc. (I don’t think SSMs are technically classified as RNNs, but they share the constant state size property.)

2dViews 292Likes 1
mike64_t@mike64_t

@aryaman2020 @willdepue The tweet above of course

1dViews 123Likes 1
lumi@agitbackprop

@willdepue the fast weights are stored in the KV cache

1dViews 117Likes 5
mike64_t@mike64_t

@aryaman2020 @willdepue yep fast weights approx. weights. Which is related to scaling BPTT. It’s also related to emergent meta learning. Ultimate Schmidhuber victory. He truly invented everything 🫡

1dViews 117Likes 5
mike64_t@mike64_t

@bilaltwovec @willdepue “Do not fall for local gradient methods” 😭

2dViews 113Likes 1
Load more posts