Zyphra derives an empirical scaling law predicting the onset of plasticity loss in LLMs during continual learning

VIEWS27.5KBOOKMARKS189LIKES234REPLIES10

the best read on this plasticity loss: - at the start of training the LLM has the most degrees of freedom - training constrains more and more of them which makes learning harder, because you had to commit into one direction to learn whatever came before

i think you can avoid this by having an architecture that has a small fast updating network with high weight decay, which distills whatever it learns into a slower updating more stable network

Oh yeah, we already have this. If you would just follow the GOAT Ali Behrouz: https://arxiv.org/pdf/2606.03979

Zyphra@ZyphraAI

Zyphra is sharing our first work in continual learning where we study: Can LLMs learn forever from new data?

Many see continual learning as a path to AGI through recursive self-improvement (RSI).

The first obstacle is plasticity loss. We derive a scaling law for its onset 🧵

9h27.5K234189

RETWEETS73

Zyphra@ZyphraAI

Zyphra is sharing our first work in continual learning where we study: Can LLMs learn forever from new data?

Many see continual learning as a path to AGI through recursive self-improvement (RSI).

The first obstacle is plasticity loss. We derive a scaling law for its onset 🧵

1d1M625661

Zyphra@ZyphraAI

Full details in the paper and blog:

Paper: http://arxiv.org/abs/2606.24752 Blog: http://zyphra.com/our-work/plasticity-loss-in-continual-learning

1d4.8K10448

Zyphra@ZyphraAI

To understand the mechanism, we looked inside the network. Dormant MLP units, lazy & collapsed attention heads, and parameter magnitudes all grow as plasticity fades, but nothing fully explains it.

We don't have a clean cause or fix yet. That's the problem we're taking on next.

1d3.8K908

Zyphra@ZyphraAI

Every model we tested, from 5M to 314M non-embedding params, eventually loses plasticity. Learning a new language (Vietnamese) improves at first, then steadily and inexorably degrades.

Bigger models transfer better and hold out longer, but eventually they all lose plasticity.

1d4.3K887

Zyphra@ZyphraAI

We fit a scaling law for when plasticity loss sets in: T ∝ P^0.83

Onset grows sublinearly as a power-law with parameter count. So scaling delays the problem with sharply diminishing returns, but scale alone cannot save us from plasticity loss.

1d13.1K7310

Zyphra@ZyphraAI

To study plasticity loss in LLMs, we built a Multilingual Continual Learning Problem. A GPT-style model cycles endlessly through 8 languages, 5B tokens each.

After each cycle we fine-tune on held-out Vietnamese data. The slower it adapts, the more plasticity is lost.

1d4.3K788

Zyphra@ZyphraAI

Plasticity loss is when a network gradually loses its ability to learn during training.

It is not catastrophic forgetting: forgetting loses old knowledge; plasticity loss is losing the capacity to learn.

We show it is general and fundamental across modern LLM architectures.

1d4.3K867

deep Manifold@BetaTomorrow

@ZyphraAI *** Why Is Continual Learning Even Possible Mathematically? ***

*** 为什么持续学习在数学上是可能的? ***

1d2.3K616

Zyphra@ZyphraAI

@ZyphraAI is an open superintelligence research and product company based in San Francisco, CA on a mission to build human-aligned AI that helps individuals and organizations reach their fullest potential.

Apply to join us! http://jobs.ashbyhq.com/zyphra

1d3.2K585

will brown@willccbb

occams razor answer to "what's going on here" is grokking imo

bullish for cartridges

will brown@willccbb

feels like the most noteworthy paper about continual learning in quite some time? most other work branded as such is essentially fancy RL-flavored context distillation, this is some real physics of language models shit

1d2.6K463

Zyphra@ZyphraAI

And it isn't unique to continual learning with nonstationary data.

When we train on a fixed, stationary mixture of all 8 languages, with no task switches, the same decline appears.

This means plasticity loss inevitably occurs even in standard pretraining pipelines.

1d3K603

Beren Millidge@BerenMillidge

Continual learning is a fundamental requirement for AGI. Here we perform the first serious study of a key bottleneck -- loss of plasticity -- in LLMs

We show plasticity loss is a very general phenomenon that follows a mechanistic scaling law. Scaling slows it but cannot stop it

Zyphra@ZyphraAI

Zyphra is sharing our first work in continual learning where we study: Can LLMs learn forever from new data?

Many see continual learning as a path to AGI through recursive self-improvement (RSI).

The first obstacle is plasticity loss. We derive a scaling law for its onset 🧵

1d5.6K7130

will brown@willccbb

also bullish for logprob-aware decay?

my current pet idea is something like ECHO but with self-generated hint as a logprob filter on env tokens (a la OPSD) to mitigate overtraining, and a core RL objective with some kind of pressure for hints to target useful world models

1d29332

xlr8harder@xlr8harder

There are a variety of techniques to revive dead neurons in different training settings. I'd really be interested to see how they affect plasticity.

Zyphra@ZyphraAI

Zyphra is sharing our first work in continual learning where we study: Can LLMs learn forever from new data?

Many see continual learning as a path to AGI through recursive self-improvement (RSI).

The first obstacle is plasticity loss. We derive a scaling law for its onset 🧵

1d1.3K191

Beren Millidge@BerenMillidge

Plasticity loss is a subtle issue where once trained long enough models start to lose the ability to learn entirely. This is not even forgetting. Models not only lose old information but also fail to learn new data. Clearly, if we cannot learn, we cannot continual-learn

1d325121

Beren Millidge@BerenMillidge

We find that plasticity loss happens both in nonstationary continual-learning environments but also in stationary pretraining-like setups. This means that if you pretrain long enough your LLM will eventually lose the ability to adapt to new data.

1d261101

Lisan al Gaib@scaling01

the hippocampus is so underrated

9h76561

Cheng Lou@_chenglou

@scaling01 Have you seen the titans lineage? https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

6h35321

Beren Millidge@BerenMillidge

In future work we will obviously try to solve plasticity loss directly, along with catastrophic forgetting, to eventually build models which can continue to grow, learn, and expand forever. This is an early, but important, step along the path.

1d1828