
The hardest hit examples were mostly bio models and some vision ones. It appears to be modality-specific, but the underlying common thread is that the highest-death layers all have a few activation dimensions with *huge* outlier values.
Anthropic's Liv Gorton highlighted the research for mechanistic interpretability

The hardest hit examples were mostly bio models and some vision ones. It appears to be modality-specific, but the underlying common thread is that the highest-death layers all have a few activation dimensions with *huge* outlier values.

In a few layers, nearly all variance sits in a handful of directions, so alignment with them dictates Top-K. https://arxiv.org/abs/2508.16929 saw and fixed a milder version in attention outputs. Our data is more extreme so we need to PCA-whiten to normalize the variance & revive the dead

You can derive this death rate analytically — it just depends on how outlier-heavy the model's activations are (the size of the mean vs. the spread). The prediction matches both synthetic data with injected outliers and real models, across hundreds of layers from ~20 models

A feature pointing away from the mean gets a negative value that the ReLU zeros out, and it never fires. In Top-K SAEs, those same outliers rig the competition: the most-aligned features win every round (always on) and the least-aligned never make the cut (always dead).

We first guessed the outliers were disrupting training by dominating the reconstruction loss, but the death starts *before* training. Most features are born dead: a feature's activation isn't based on the input, but rather on how its random init state aligns with the outliers.

Training can fix this, but very slowly — features don't fully revive until the model learns to center the data, often taking millions of steps. Centering from the start skips all that: less death, easier training, better features. It works for *almost* every layer we tested...

There's a lot more in our ICML 2026 paper (with @etowah0 & @james_y_zou!) — the full theory, analysis on how features recover/die in training, and results across language, vision, protein, and genomic models. 📝: https://arxiv.org/abs/2605.31518 💻: https://github.com/ElanaPearl/sae-feature-death