the old idea that some state should change quickly while deeper priors change slowly feels obviously right, just not solved at scale yet
my bias is toward [hierarchical] recurrence/state/latent memory here
RAG-style memory is useful, but it’s still mostly “retrieve text and paste it into context.” the more interesting version is a system with persistent internal state: something that updates, decays, compresses, and changes future computation
fast weights / slow weights feel like one old path toward that
fast state = recent trajectory
medium state = reusable episodes / temporary adapters
slow state = consolidated priors
the latent part because real memory probably shouldn’t always be stored as text. sometimes it should be a compressed residue of repeated computation,, what keeps mattering, what keeps recurring, what should become easier next time
the history of stateful archs is interesting...
RNNs -> frequent updates, coherence doesn't travel far. Recurrence can learn to ferry features, but long-range credit assignment gets ugly.
HRNNs / Clockwork RNNs -> hierarchy, slower/faster timescales, but still entangled & smeared past.
LSTMs & OpenAI Five -> gated recurrence made state more usable. You get persistent hidden state, selective write/forget dynamics, and enough temporal continuity for tactics, opponent modeling, cooldowns, positioning, and “what was I doing 20 seconds ago?” OpenAI-Five is especially interesting here because the policy wasn’t just re-reading a transcript of the match; it had recurrent state folded into the policy loop. Not perfect memory, but actual operational state.
Transformers -> insane associative recall inside context, but weirdly stateless between calls unless you bolt memory/state on from the outside. Context is not the same thing as a continuously updated latent state. A transcript can describe your past, but it is not the same as carrying forward a compact internal dynamics vector.
Hierarchical VLM pairs (*which are particularly interesting...*) -> S1 is the fast visuomotor policy translating S2’s latent semantic representations into continuous actions at ~200 Hz. It's not just words/text being sent from S2 to S1. S2 learns to best send semantic/policy-conditioning, S1 learns best to interpret it, which helps reduce brittle memorization and increase generalization. There is some level of state continuity, though technically the top-level S2 may still be stateless-ish per step lol.
So yeah strong suspicions that state and hierarchy may be helpful. :x
🧐🤔
continual learning probably ends up as multi-rate memory like as in in google's titans/Miras, or the memory layers as in their MAL looks interesting/promising imo. or maybe some TTT-online lora thing.