/Tech9h ago

Stanford's Anshul Kundaje teases research exposing a major flaw in standard genomic scaling and logarithmic normalization

The pipeline issue impacts datasets used for AIxBio models.

144105839052.5K
Original post
Lior Pachter@lpachter

Arguably the most boring step in genomics is the first one: normalization. Settled science. Scale + log. Move on.

Except that here's been a huge blind spot in the field. And it matters for AIxBio. A 🧵about what I think may be one of the most important papers I've written. 1/

12:44 PM · Jun 10, 2026 · 53.2K Views
Sentiment

Many users praised Kundaje's identification of a normalization blind spot in genomics impacting AIxBio because it promotes better compositional data analysis techniques and rethinking of basics in the field.

Pos
100.0%
Neg
0.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5K
Lior Pachter@lpachter

The standard normalization is log(x/s*K+1) w/ K=10,000 in Seurat and Scanpy. It's been used in hundreds of thousands of studies. AI agents nowadays run it routinely.

In an expansive benchmark in @naturemethods, Ahlmann-Eltze & Huber conclude it's pretty much best in class. 2/

10hViews 5KLikes 21Bookmarks 7
BOOKMARKS23LIKES44REPLIES2
Lior Pachter@lpachter

There is a ton more so there will be more threads. E.g., we show how important it is to incorporate dataset specific overdispersion estimates in the shifted CLR pseudocount.

tl;dr: After the standard scale and log transform.. center cells to zero! https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3

10hViews 2.7KLikes 44Bookmarks 23
RETWEETS58
Lior Pachter@lpachter

Arguably the most boring step in genomics is the first one: normalization. Settled science. Scale + log. Move on.

Except that here's been a huge blind spot in the field. And it matters for AIxBio. A 🧵about what I think may be one of the most important papers I've written. 1/

10hViews 53.2KLikes 416Bookmarks 393
Lior Pachter@lpachter

First, a link to the preprint describing the method (PFlogPF) that produced the amazing results shown above: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3 The work was led by @sinabooeshaghi who is first and corresponding author, w/ important contributions by @IngileifBryndis & @agalvezmerchan. 4/

10hViews 3.8KLikes 38Bookmarks 18
Lior Pachter@lpachter

Except it isn't. Not even close. In a project that is four years in the making, we show that another transformation massively outperforms existing methods on the Ahlmann-Eltze & Huber benchmarks (red dots below). Moreover, it's optimal. What is this new method? How can it be? 3/

10hViews 4.5KLikes 27Bookmarks 6
Lior Pachter@lpachter

In our preprint we prove a theorem: the only method satisfying rank monotonicity, perturbation additivity (plays well with PCA), relabeling equivariance (input order doesn't matter), depth invariance, and a basic calibration, is CLR. 13/

10hViews 3.2KLikes 28Bookmarks 4
Lior Pachter@lpachter

So people benchmarked what they thought was CLR but it was "CLR" (as implemented in Seurat) and they got terrible results. It seems CLR was therefore abandoned and ignored. See, e.g. https://academic.oup.com/bioinformatics/article/38/1/164/6367764 from @drisso1893 and @crmn72. 17/

10hViews 2KLikes 19Bookmarks 3
Lior Pachter@lpachter

This improvement of PFlogPF over the standard logPF is massive. It's rare in bioinformatics to see such a quantum leap. This is a reproduction of another Ahlmann-Eltze & Huber figure with PFlogPF added. PFlogPF preserves ~35 neighbors (out of 50) as opposed to ~5! 20/

10hViews 2.1KLikes 26Bookmarks 2
Lior Pachter@lpachter

The Seurat team was informed that Seurat "CLR" is problematic (e.g., https://github.com/satijalab/seurat/issues/2624), but that's a story for another thread. BTW Scanpy doesn't have a CLR implementation. So even though CLR is used frequently in, e.g. metagenomics, it is not in single-cell fields. 18/

10hViews 1.6KLikes 16Bookmarks 4
Lior Pachter@lpachter

Turns out PFlogPF matters for much than just doing PCA after normalization. It turns out that normalization interacts with gene selection / filtering. Even a small shift in gene selection can distort geometry with standard normalization, a problem that PFlogPF fixes. 21/

10hViews 1.6KLikes 12Bookmarks 4
Lior Pachter@lpachter

These goals are not controversial. The Ahlmann-Eltze & Huber paper benchmarks methods on variance stabilization and depth normalization. E.g., it subsamples a dataset, runs PCA and builds a kNN graph on the full and subsampled data, and compares them. That's this plot. 11/

10hViews 2.2KLikes 14Bookmarks 4
Lior Pachter@lpachter

Wait, hasn't this been done before?! CLR is classic. Compositional analysis is a known method. Well it turns out it was implemented in Seurat, but incorrectly. Details are in our Supplementary Note: https://www.biorxiv.org/content/biorxiv/early/2026/06/09/2022.05.06.490859/DC2/embed/media-2.pdf The Seurat "CLR" is worse than not touching the data. 16/

10hViews 1.7KLikes 18Bookmarks 3
Lior Pachter@lpachter

What is PFlogPF? It commonly goes by another name: the shifted centered-log transform (CLR). It should also be called PFlogPF for good reason (more on that later). The CLR was introduced by John Aitchison in... wait for it... 1982: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1982.tb01195.x So why isn't it used? 5/

10hViews 3.2KLikes 19Bookmarks 2
Lior Pachter@lpachter

We refer to shifted CLR as PFlogPF because it is equivalent to the current normalization method (logPF) followed by another proportional fitting (we derive this in the Supplementary Note). This means that one can take existing normalized count matrices and just center them! 19/

10hViews 1.5KLikes 11Bookmarks 4
Lior Pachter@lpachter

The theorem is great but isn't practical. It applies only to positive counts. Counts can be zero. So we prove additionally that the shifted CLR is the way to deal with zeroes. This involves a pseudocount related to variance stabilization. The result is the best of all worlds. 14/

10hViews 2KLikes 14Bookmarks 2
Lior Pachter@lpachter

Well it is! And it isn't! WHAT?!?!

Before we get into explaining these contradictory statements, let's talk for a minute about why normalization is performed in the first place, and what it should accomplish. There are three basic requirements: 6/

10hViews 2.7KLikes 11Bookmarks 3
Lior Pachter@lpachter

1. Stabilize variance.

In single-cell genomics the variance of counts of, say, genes across cells increases with the mean expression. This is bad because downstream procedures such as PCA will be insensitive to lowly expressed genes. Variance stabilization fixes that. 7/

10hViews 2.5KLikes 12Bookmarks 2
Lior Pachter@lpachter

On the theory side, we also show that not only is PFlogPF (shift. CLR) unique given the desiderata, but also that all the axioms are necessary. This turns out to be interesting in its own right, because to show monotonicity is necessary we use... the axiom of choice (!) 22/

10hViews 1.3KLikes 13
Lior Pachter@lpachter

2. Eliminate depth dependency.

If you sequence twice as much you get double the number of counts. Normalization should ensure that results of a study are not sensitive to this technical artifact. In other words, transformation should facilitate compositional.analysis. 8/

10hViews 2.3KLikes 11
Lior Pachter@lpachter

Getting these three things is non-trivial. E.g. to variance stabilize you could just scale all variances of every gene to be equal (perfect!), but then you lose monotonicity. Depth normalization per cell makes everything compositional. But doesn't play well with variance. 10/

10hViews 2.3KLikes 8
Load more posts