/Tech9h ago

Stanford's Anshul Kundaje teases research exposing a major flaw in standard genomic scaling and logarithmic normalization

The pipeline issue impacts datasets used for AIxBio models.

144105839052.5K

#1077

Original post

Lior Pachter@lpachter

Arguably the most boring step in genomics is the first one: normalization. Settled science. Scale + log. Move on.

Except that here's been a huge blind spot in the field. And it matters for AIxBio. A 🧵about what I think may be one of the most important papers I've written. 1/

12:44 PM · Jun 10, 2026 · 53.2K Views

/Tech9h ago

Stanford's Anshul Kundaje teases research exposing a major flaw in standard genomic scaling and logarithmic normalization

The pipeline issue impacts datasets used for AIxBio models.

144105839052.5K

#1077

Original post

Lior Pachter@lpachter

Arguably the most boring step in genomics is the first one: normalization. Settled science. Scale + log. Move on.

Except that here's been a huge blind spot in the field. And it matters for AIxBio. A 🧵about what I think may be one of the most important papers I've written. 1/

12:44 PM · Jun 10, 2026 · 53.2K Views

Sentiment

Many users praised Kundaje's identification of a normalization blind spot in genomics impacting AIxBio because it promotes better compositional data analysis techniques and rethinking of basics in the field.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5K

Lior Pachter@lpachter

The standard normalization is log(x/s*K+1) w/ K=10,000 in Seurat and Scanpy. It's been used in hundreds of thousands of studies. AI agents nowadays run it routinely.

In an expansive benchmark in @naturemethods, Ahlmann-Eltze & Huber conclude it's pretty much best in class. 2/

10h5K217

BOOKMARKS23LIKES44REPLIES2

Lior Pachter@lpachter

There is a ton more so there will be more threads. E.g., we show how important it is to incorporate dataset specific overdispersion estimates in the shifted CLR pseudocount.

tl;dr: After the standard scale and log transform.. center cells to zero! https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3

10h2.7K4423

RETWEETS58

Lior Pachter@lpachter

Arguably the most boring step in genomics is the first one: normalization. Settled science. Scale + log. Move on.

Except that here's been a huge blind spot in the field. And it matters for AIxBio. A 🧵about what I think may be one of the most important papers I've written. 1/

10h53.2K416393

Lior Pachter@lpachter

First, a link to the preprint describing the method (PFlogPF) that produced the amazing results shown above: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3 The work was led by @sinabooeshaghi who is first and corresponding author, w/ important contributions by @IngileifBryndis & @agalvezmerchan. 4/

10h3.8K3818

Lior Pachter@lpachter

Except it isn't. Not even close. In a project that is four years in the making, we show that another transformation massively outperforms existing methods on the Ahlmann-Eltze & Huber benchmarks (red dots below). Moreover, it's optimal. What is this new method? How can it be? 3/

10h4.5K276

Lior Pachter@lpachter

In our preprint we prove a theorem: the only method satisfying rank monotonicity, perturbation additivity (plays well with PCA), relabeling equivariance (input order doesn't matter), depth invariance, and a basic calibration, is CLR. 13/

10h3.2K284

Lior Pachter@lpachter

So people benchmarked what they thought was CLR but it was "CLR" (as implemented in Seurat) and they got terrible results. It seems CLR was therefore abandoned and ignored. See, e.g. https://academic.oup.com/bioinformatics/article/38/1/164/6367764 from @drisso1893 and @crmn72. 17/

10h2K193

Lior Pachter@lpachter

This improvement of PFlogPF over the standard logPF is massive. It's rare in bioinformatics to see such a quantum leap. This is a reproduction of another Ahlmann-Eltze & Huber figure with PFlogPF added. PFlogPF preserves ~35 neighbors (out of 50) as opposed to ~5! 20/

10h2.1K262

Lior Pachter@lpachter

The Seurat team was informed that Seurat "CLR" is problematic (e.g., https://github.com/satijalab/seurat/issues/2624), but that's a story for another thread. BTW Scanpy doesn't have a CLR implementation. So even though CLR is used frequently in, e.g. metagenomics, it is not in single-cell fields. 18/

10h1.6K164

Lior Pachter@lpachter

Turns out PFlogPF matters for much than just doing PCA after normalization. It turns out that normalization interacts with gene selection / filtering. Even a small shift in gene selection can distort geometry with standard normalization, a problem that PFlogPF fixes. 21/

10h1.6K124

Lior Pachter@lpachter

These goals are not controversial. The Ahlmann-Eltze & Huber paper benchmarks methods on variance stabilization and depth normalization. E.g., it subsamples a dataset, runs PCA and builds a kNN graph on the full and subsampled data, and compares them. That's this plot. 11/

10h2.2K144

Lior Pachter@lpachter

Wait, hasn't this been done before?! CLR is classic. Compositional analysis is a known method. Well it turns out it was implemented in Seurat, but incorrectly. Details are in our Supplementary Note: https://www.biorxiv.org/content/biorxiv/early/2026/06/09/2022.05.06.490859/DC2/embed/media-2.pdf The Seurat "CLR" is worse than not touching the data. 16/

10h1.7K183

Lior Pachter@lpachter

What is PFlogPF? It commonly goes by another name: the shifted centered-log transform (CLR). It should also be called PFlogPF for good reason (more on that later). The CLR was introduced by John Aitchison in... wait for it... 1982: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1982.tb01195.x So why isn't it used? 5/

10h3.2K192

Lior Pachter@lpachter

We refer to shifted CLR as PFlogPF because it is equivalent to the current normalization method (logPF) followed by another proportional fitting (we derive this in the Supplementary Note). This means that one can take existing normalized count matrices and just center them! 19/

10h1.5K114

Lior Pachter@lpachter

The theorem is great but isn't practical. It applies only to positive counts. Counts can be zero. So we prove additionally that the shifted CLR is the way to deal with zeroes. This involves a pseudocount related to variance stabilization. The result is the best of all worlds. 14/

10h2K142

Lior Pachter@lpachter

Well it is! And it isn't! WHAT?!?!

Before we get into explaining these contradictory statements, let's talk for a minute about why normalization is performed in the first place, and what it should accomplish. There are three basic requirements: 6/

10h2.7K113

Lior Pachter@lpachter

1. Stabilize variance.

In single-cell genomics the variance of counts of, say, genes across cells increases with the mean expression. This is bad because downstream procedures such as PCA will be insensitive to lowly expressed genes. Variance stabilization fixes that. 7/

10h2.5K122

Lior Pachter@lpachter

On the theory side, we also show that not only is PFlogPF (shift. CLR) unique given the desiderata, but also that all the axioms are necessary. This turns out to be interesting in its own right, because to show monotonicity is necessary we use... the axiom of choice (!) 22/

10h1.3K13

Lior Pachter@lpachter

2. Eliminate depth dependency.

If you sequence twice as much you get double the number of counts. Normalization should ensure that results of a study are not sensitive to this technical artifact. In other words, transformation should facilitate compositional.analysis. 8/

10h2.3K11

Lior Pachter@lpachter

Getting these three things is non-trivial. E.g. to variance stabilize you could just scale all variances of every gene to be equal (perfect!), but then you lose monotonicity. Depth normalization per cell makes everything compositional. But doesn't play well with variance. 10/

10h2.3K8