/Tech3h ago

Elana Pearl finds that despite auxiliary reconstruction loss mitigations, over 70% of sparse autoencoder features can remain dead

Anthropic's Liv Gorton highlighted the research for mechanistic interpretability

2598396.1K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS255BOOKMARKS1

Elana Simon@ElanaPearl

The hardest hit examples were mostly bio models and some vision ones. It appears to be modality-specific, but the underlying common thread is that the highest-death layers all have a few activation dimensions with *huge* outlier values.

3h25551

LIKES5REPLIES1

Elana Simon@ElanaPearl

In a few layers, nearly all variance sits in a handful of directions, so alignment with them dictates Top-K. https://arxiv.org/abs/2508.16929 saw and fixed a milder version in attention outputs. Our data is more extreme so we need to PCA-whiten to normalize the variance & revive the dead

3h1665

Elana Simon@ElanaPearl

You can derive this death rate analytically — it just depends on how outlier-heavy the model's activations are (the size of the mean vs. the spread). The prediction matches both synthetic data with injected outliers and real models, across hundreds of layers from ~20 models

3h1275

Elana Simon@ElanaPearl

A feature pointing away from the mean gets a negative value that the ReLU zeros out, and it never fires. In Top-K SAEs, those same outliers rig the competition: the most-aligned features win every round (always on) and the least-aligned never make the cut (always dead).

3h1304

Elana Simon@ElanaPearl

We first guessed the outliers were disrupting training by dominating the reconstruction loss, but the death starts *before* training. Most features are born dead: a feature's activation isn't based on the input, but rather on how its random init state aligns with the outliers.

3h1622

Elana Simon@ElanaPearl

Training can fix this, but very slowly — features don't fully revive until the model learns to center the data, often taking millions of steps. Centering from the start skips all that: less death, easier training, better features. It works for *almost* every layer we tested...

3h1132

Elana Simon@ElanaPearl

There's a lot more in our ICML 2026 paper (with @etowah0 & @james_y_zou!) — the full theory, analysis on how features recover/die in training, and results across language, vision, protein, and genomic models. 📝: https://arxiv.org/abs/2605.31518 💻: https://github.com/ElanaPearl/sae-feature-death

3h1475