the hippocampus is so underrated
the best read on this plasticity loss: - at the start of training the LLM has the most degrees of freedom - training constrains more and more of them which makes learning harder, because you had to commit into one direction to learn whatever came before
i think you can avoid this by having an architecture that has a small fast updating network with high weight decay, which distills whatever it learns into a slower updating more stable network
Oh yeah, we already have this. If you would just follow the GOAT Ali Behrouz: https://arxiv.org/pdf/2606.03979

