/Tech7h ago

scGPT developer Bo Wang highlights analysis warning that improper single-cell genomics normalization distorts downstream biological AI models

Without variance stabilization, high-mean genes disproportionately dominate PCA.

223284.8K
Original post unavailable.
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS236
sina@sinabooeshaghi

4. These properties are well established and accepted by the field for evaluating a "good" normalization method. Both sctransform and the Ahlmann-Eltze & Huber benchmark methods against variance stabilization and depth normalization.

6hViews 236Bookmarks 1
BOOKMARKS2LIKES6
sina@sinabooeshaghi

12. The point is that AI isn't thinking deeply. It's not reading the literature, developing reasonable evaluation criteria, nor benchmarking normalization methods against it. It repeats the field's default and confidently justifies it. In this case, we know the answer. But what happens when we don't?

6hViews 199Likes 6Bookmarks 2
RETWEETS2
sina@sinabooeshaghi

9. Strangely enough, when you ask Claude Code or ChatGPT what normalization method to use, it tells you sctransform (ChatGPT) or the shifted log (Claude Code), the method favored by the AE&H benchmark. Ask why, and it says they satisfy the criteria listed above. But they don't!

6hViews 203Likes 4Bookmarks 2
REPLIES1
sina@sinabooeshaghi

⤴️ Top of the thread

6hViews 170
sina@sinabooeshaghi

8. The clearest benchmark result is downsampling (introduced by AE&H, Nature Methods 2023). Take a deeply sequenced dataset, downsample it, and count how many of each cell's 50 nearest neighbors survive. PFlogPF keeps 36.8. The other methods keep about 5.8.

6hViews 145Likes 1Bookmarks 1
sina@sinabooeshaghi

5. The "monotonicity" requirement was underdiscussed in the literature but of importance. A method that is not monotone scrambles the ordering of genes within a cell (and cell type), making it challenging to compare any two genes.

6hViews 198Likes 2Bookmarks 1
sina@sinabooeshaghi

13. In conclusion, if you want a scRNAseq normalization method to best satisfy - depth norm - variance stabilization - monotonicity

Run PFlogPF (package coming soon).

The code is available here: http://github.com/pachterlab/BHGP_2022

The manuscript is available here: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3

6hViews 132Likes 2Bookmarks 1
sina@sinabooeshaghi

6. These three (practical) metrics can be associated to mathematical axioms that a normalization method must satisfy. In our supplementary note, we prove that these axioms produce a unique normalization method for single-cell rnaseq data (PFlogPF), also known as the shifted CLR.

6hViews 168Bookmarks 1
sina@sinabooeshaghi

11. And the method that does satisfy all three isn't new. It's the centered log-ratio, from 1982! This transform has been available for 40+ years, passed over in hundreds of thousands of scRNAseq studies for methods that perform poorly with respect to these desiderata.

6hViews 130Likes 3
sina@sinabooeshaghi

7. This theorem, plus a large-scale benchmark on 526 datasets, convinced us PFlogPF best satisfies these desiderata in practice compared to other methods.

6hViews 157
sina@sinabooeshaghi

10. Each method fails one of the three. sctransform is not monotone (it scrambles within-cell gene order). The shifted log doesn't remove depth (that's the whole reason for the second PF step in PFlogPF). The table below, from our Supplement, shows the Axioms and whether each method satisfies them.

6hViews 128
sina@sinabooeshaghi

Corresponding thread:

6hViews 127Likes 1