ABSTRACT
Bayesian optimisation (BO) is a popular sequential decision making approach for maximising black-box functions in low-data regimes. It can be used to find highlyfit protein sequence candidates since gradient information is not available in vitro. Recent in silico protein design methods have leveraged large pre-trained protein language models (PLMs) as fitness predictors. However PLMs have a number of shortcomings for sequential design tasks: i) their current capability to model uncertainty, ii) no closed-form Bayesian updates in light of new experimental data, and iii) the challenge of fine-tuning on small down-stream task datasets. We take a step back to traditional BO by using Gaussian process (GP) surrogate models with sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. In this work we empirically demonstrate that BO with GP surrogates is consistent with large pre-trained PLMs on the multi-round sequence design benchmark ProteinGym. Furthermore, we demonstrate improved performance by augmenting the GP with the strong zero-shot PLM predictions as a GP prior mean function, and show that by using a learned linear combination of zero-shot PLM and constant prior mean the GP surrogate can regulate the effects of the PLM guided prior.