ABSTRACT
RNA sequencing (RNA-seq) has become a key technology in precision medicine, especially for cancer prognosis. However, the high dimensionality of such data may restrict classic statistical methods, thus raising the need to learn dense representations from them. Transformers models have exhibited capacities in providing representations for long sequences and thus are well suited for transcriptomics data. In this paper, we develop a pre-trained transformer-based language model through self-supervised learning using bulk RNA-seq from both non-cancer and cancer tissues, following BERT’s masking method. By probing learned embeddings from the model or using parameter-efficient fine tuning, we then build downstream models for cancer-type classification and survival-time prediction. Leveraging the TCGA dataset, we demonstrate the performance of our method, BulkRNABert, on both tasks, with significant improvement compared to state-of-the-art methods in the pan cancer setting for classification and survival analysis. We also show the transfer-learning capabilities of the model in the survival analysis setting on unseen cohorts.