/Tech7h ago

scGPT developer Bo Wang highlights analysis warning that improper single-cell genomics normalization distorts downstream biological AI models

Without variance stabilization, high-mean genes disproportionately dominate PCA.

223284.8K

Original post unavailable.

/Tech7h ago

scGPT developer Bo Wang highlights analysis warning that improper single-cell genomics normalization distorts downstream biological AI models

Without variance stabilization, high-mean genes disproportionately dominate PCA.

223284.8K

Original post unavailable.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

sina@sinabooeshaghi

4. These properties are well established and accepted by the field for evaluating a "good" normalization method. Both sctransform and the Ahlmann-Eltze & Huber benchmark methods against variance stabilization and depth normalization.

6h2361

BOOKMARKS2LIKES6

sina@sinabooeshaghi

12. The point is that AI isn't thinking deeply. It's not reading the literature, developing reasonable evaluation criteria, nor benchmarking normalization methods against it. It repeats the field's default and confidently justifies it. In this case, we know the answer. But what happens when we don't?

6h19962

RETWEETS2

sina@sinabooeshaghi

9. Strangely enough, when you ask Claude Code or ChatGPT what normalization method to use, it tells you sctransform (ChatGPT) or the shifted log (Claude Code), the method favored by the AE&H benchmark. Ask why, and it says they satisfy the criteria listed above. But they don't!

6h20342

REPLIES1

sina@sinabooeshaghi

⤴️ Top of the thread

6h170

sina@sinabooeshaghi

8. The clearest benchmark result is downsampling (introduced by AE&H, Nature Methods 2023). Take a deeply sequenced dataset, downsample it, and count how many of each cell's 50 nearest neighbors survive. PFlogPF keeps 36.8. The other methods keep about 5.8.

6h14511

sina@sinabooeshaghi

5. The "monotonicity" requirement was underdiscussed in the literature but of importance. A method that is not monotone scrambles the ordering of genes within a cell (and cell type), making it challenging to compare any two genes.

6h19821

sina@sinabooeshaghi

13. In conclusion, if you want a scRNAseq normalization method to best satisfy - depth norm - variance stabilization - monotonicity

Run PFlogPF (package coming soon).

The code is available here: http://github.com/pachterlab/BHGP_2022

The manuscript is available here: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3

6h13221

sina@sinabooeshaghi

6. These three (practical) metrics can be associated to mathematical axioms that a normalization method must satisfy. In our supplementary note, we prove that these axioms produce a unique normalization method for single-cell rnaseq data (PFlogPF), also known as the shifted CLR.

6h1681

sina@sinabooeshaghi

11. And the method that does satisfy all three isn't new. It's the centered log-ratio, from 1982! This transform has been available for 40+ years, passed over in hundreds of thousands of scRNAseq studies for methods that perform poorly with respect to these desiderata.

6h1303

sina@sinabooeshaghi

7. This theorem, plus a large-scale benchmark on 526 datasets, convinced us PFlogPF best satisfies these desiderata in practice compared to other methods.

6h157

sina@sinabooeshaghi

10. Each method fails one of the three. sctransform is not monotone (it scrambles within-cell gene order). The shifted log doesn't remove depth (that's the whole reason for the second PF step in PFlogPF). The table below, from our Supplement, shows the Axioms and whether each method satisfies them.

6h128

sina@sinabooeshaghi

Corresponding thread:

6h1271

Paula Chandler@SilverHollowsz

@BoWang87 @Kevin_McKernan

6h2