ABSTRACT
Rapid identification of T cell receptors (TCRs) that specifically bind patient-unique neoepitopes is a critical challenge for personalized TCR-based therapies in oncology. Due to enormous diversity of both TCR and neoepitope repertoires, a machine learning predictor of TCR-pMHC specificity for personalized therapy must generalize to TCRs and epitopes not seen in the training data. We estimate the necessary size of such training data. We first confirm that published models fail to generalize beyond a single-residue dissimilarity to the epitope training set distribution. We then impute the point-mutation ligandome across the 34 most prevalent human MHC alleles and represent it as a graph based on our established dissimilarity cutoff. By finding the dominating set of this graph, we estimate that between one and 100 million epitopes are required to train a generalizable sequence-based TCR specificity prediction model—1000 times the size of current public data.