ABSTRACT
Understanding the function of proteins is a fundamental goal in molecular biology, with far-reaching implications for drug discovery, disease diagnosis, and biotechnology. While sequence-based representations have enjoyed significant attention in machine learning, structure-based models have yet to realize their full potential. In this paper, we introduce BioCLIP, a self-supervised contrastive learning framework for generating Protein Structure Models (PSMs) based on Protein Language Models (PLMs) like ESM2. BioCLIP utilizes a Contrastive Language–Image Pretraining (CLIP)-inspired loss function that leverages both per-residue and per-chain embeddings to learn meaningful structural representations. We evaluate BioCLIP on a range of downstream tasks, including protein-protein interaction prediction, Gene Ontology (GO) term annotation, and Enzyme Commission (EC) number prediction. Our results demonstrate that: (1) BioCLIP-trained Graph Neural Networks (GNNs) significantly outperform models trained from scratch across all tasks, (2) the learned structural embeddings are additive to sequence embeddings, typically boosting performance, and (3) BioCLIP is competitive with, or outperforms, task-specific methods on all benchmarks despite being a single pre-trained network that is fine-tuned for each challenge. Our work addresses key limitations in the availability of high quality structural data and the challenges of designing self-supervised objectives for structural representations, paving the way for more comprehensive models of protein function.