ABSTRACT
In recent years, large language models trained on enormous corpora of unlabeled biological sequence data have demonstrated state-of-the-art performance on a variety of downstream tasks. These LLMs have been successful in modeling both genomic and proteomic sequences and their representations have been used to outperform specialized models in a myriad of tasks. Since the genome contains the information to encode all proteins, genomic language models hold the potential to make downstream predictions about proteins as well as DNA. This observation motivates a model that can perform well on both genomic and proteomic downstream tasks. However, since there are few tasks which pair proteins to their true coding DNA sequences, it is difficult to compare the two model types. In this work we curate five such datasets and use them to evaluate the performance of multiple state-of-the-art genomic and proteomic models. We find that, despite their pre-training on largely non-coding sequences, genomics language models are competitive and even outperform their protein counterparts on some tasks.