Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean

Carolin Benjamins 1 | Shikha Surana | Oliver Bent | Marius Lindauer 1 | Paul Duckworth

1 Leibniz University Hannover

Published

ABSTRACT

Bayesian optimisation (BO) is a popular sequential decision making approach for maximising black-box functions in low-data regimes. It can be used to find highlyfit protein sequence candidates since gradient information is not available in vitro. Recent in silico protein design methods have leveraged large pre-trained protein language models (PLMs) as fitness predictors. However PLMs have a number of shortcomings for sequential design tasks: i) their current capability to model uncertainty, ii) no closed-form Bayesian updates in light of new experimental data, and iii) the challenge of fine-tuning on small down-stream task datasets. We take a step back to traditional BO by using Gaussian process (GP) surrogate models with sequence kernels, which are able to properly model uncertainty and update their belief over multi-round design tasks. In this work we empirically demonstrate that BO with GP surrogates is consistent with large pre-trained PLMs on the multi-round sequence design benchmark ProteinGym. Furthermore, we demonstrate improved performance by augmenting the GP with the strong zero-shot PLM predictions as a GP prior mean function, and show that by using a learned linear combination of zero-shot PLM and constant prior mean the GP surrogate can regulate the effects of the PLM guided prior.

InstaDeep
Privacy Overview

Please read our extensive Privacy policy here. You can also read our Privacy Notice and our Cookie Notice