/Tech6h ago

AI Models Use External Annotations For Context-Specific Learning In Post-Training

414002.1K

Original post unavailable.

Sentiment

Users are optimistic about AI models using external annotations and biological knowledge in post-training, praising included benchmarks, inductive biases, and biodefense focus as a promising step forward.

Pos

100.0%

Neg

0.0%

7 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Anshul Kundaje@anshulkundaje

From what I can tell, this preview model is still not really leveraging the massive amounts of molecular profiling data across many species. 6/

6h2985

LIKES5RETWEETS1

Anshul Kundaje@anshulkundaje

It is great to see an actual long range regulatory benchmark in the blog post (enhancer-to-gene linking) based on CRISPRi FlowFiSH data. Previous long context DNALMs have avoided these benchmarks cuz those models don't learn reg. elements well or their long range effects 17/

6h2125

REPLIES2

Anshul Kundaje@anshulkundaje

Fine mapping also has its problems but I wud certainly treat MPRA nominations from a cell-line as a stronger gold standard. Not a huge deal cuz the model seems to do overall pretty well (although it is a small set of loci tested). 16/

6h2592

Anshul Kundaje@anshulkundaje

That being said, the performance numbers reported (not verifiable at the moment) seem strong & non trivial. Now some specific comments 10/

6h2455

Anshul Kundaje@anshulkundaje

When it does, it will be even more powerful, if it is able to seamlessly transfer massive functional data from limited species into poorly profiled species & siphon evolutionary information adaptively into specific species (e.g. humans) that have a lot of functional data. 7/

6h2874

Anshul Kundaje@anshulkundaje

A huge proportion of functional human DNA does not have strong conservation signatures so it remains to be seen how these models do in such regimes. 9/

6h2524

Anshul Kundaje@anshulkundaje

The comparisons to the Borzoi (as a representative of supervised sequence S2F models) needs some nuance (on TraitGym GWAS / QTLs in particular). For disease/trait variant fine mapping, it is critical that S2F models are trained on disease relevant cell type data. 11/

6h2394

Anshul Kundaje@anshulkundaje

The base Borzoi model lacks many disease relevant cell contexts (even though the data is often available ... later versions do train on single cell pseudobulks from primary cell types). So lack of correct context can result in performance drops. 12/

6h2304

Anshul Kundaje@anshulkundaje

Also, want to note that from the model description available there are a LOT of inductive biases in the architecture (block conv for motif learning, not a pure Xformer), sparse attention anchoring on functional annotations etc. I am happy to see this. 22/

6h2124

Anshul Kundaje@anshulkundaje

Some broad comments about the benchmarks. All the benchmarks are tilted quite a bit toward conserved elements. ClinVAR definitely is. Even TraitGym which is focused on common variants as designed has a tilt towards conserved elements. 8/

6h2633

Anshul Kundaje@anshulkundaje

It does suggest that Omnii does better without requiring such contextual info but a stronger comparator wud be when S2F models have the right cell contexts + are explicitly adapted to predict disease risk (instead of molecular effects). 13/

6h2293

Anshul Kundaje@anshulkundaje

The AD case study with the MPRA vs fine mapping is a bit weird. The MPRA is treated a bit like a gold standard. MPRAs are often used to "validate" variant nominations. But there is a lot of incoming evidence that MPRAs are not remotely reliable to validate disease variants. 15/

6h2203

Anshul Kundaje@anshulkundaje

To be fair, proper benchmarking in this zone is still quite nebulous, so this is not a major critique of the benchmarks used here. 14/

6h2183

Anshul Kundaje@anshulkundaje

That being said, there are clear conventions for this benchmark i.e. distance stratification & auPR instead of auROC. Will be important to see this & a direct head-to-head against the SOTA (AlphaGenome, rE2G etc) 18/

6h1942

Anshul Kundaje@anshulkundaje

I am hopeful there will be more details incoming, a chance for others to battle test the models & hopefully this is the beginning of an actual fusion of evolutionary self-supervision with valuable specific specific functional context conditioning. 25/

5h2931

Anshul Kundaje@anshulkundaje

I think it is foolish to not use freely available biological knowledge, annotations directly or indirectly when designing & training models just to desperately adhere to "bitter lesson". 23/

6h2201

Anshul Kundaje@anshulkundaje

Once strong models are achieved, the lessons learned can help make them sleeker / faster by replacing special purpose modules with optimal general purpose ones if that really helps. 24/

5h1951

Anshul Kundaje@anshulkundaje

I wud also love to see future iteration of this model post-trained on massive functional data on the exact Borzoi/AlphaGenome/PromoterAI benchmark suite 19/

6h1931

Anshul Kundaje@anshulkundaje

But, overall this looks like a promising step forward giving them the benefit of doubt without having seen any of the actual details of the model. 21/

6h1921

Anshul Kundaje@anshulkundaje

These collectively address difficult benchmarks for non-coding regulatory DNA that are quite orthogonal to ClinVAR and TraitGym but essential to test whether models can effectively predict diverse types of variant effects 20/

6h1901