
TARIO-2 is a foundation model trained on aligned tiles of human H&E pathology images and spatial RNA transcripts. Each modality is encoded into a sequence of tokens with a modality-specific encoder; then the sequences from both modalities are concatenated.
During pretraining, a transformer-based decoder is used to predict later tokens in the sequence from earlier tokens. At inference time, the model operates on clinical H&E images alone.