Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates

Carolin Benjamins ¹ | Shikha Surana | Oliver Bent | Marius Lindauer ¹ | Paul Duckworth

¹ Leibniz University Hannover

ABSTRACT

Bayesian optimization (BO) is a popular sequential decision making approach for maximizing black-box functions in low-data regimes. In biology, it has been used to find well-performing protein sequence candidates since gradient information is not available from in vitro experimentation. Recent in silico design methods have leveraged large pre-trained protein language models (PLMs) to predict protein fitness. However PLMs have a number of shortcomings for sequential design tasks: i) their current limitation to model uncertainty, ii) the lack of closed-form Bayesian updates in light of new experimental data, and iii) the challenge of finetuning on small downstream task datasets. We take a step back to traditional BO by investigating Gaussian process (GP) surrogate models with various sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. We empirically evaluate our method on the sequence design benchmark ProteinGym, and demonstrate that BO with GPs is competitive with large SOTA pre-trained PLMs at a fraction of the compute budget.

Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates

Published

Share