Research Papers

Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean

Carolin Benjamins | Shikha Surana | Oliver Bent | Marius Lindauer | Paul Duckworth

NeurIPS 2024 workshop Dec 2024
Bayes Opt for Protein Design

BulkRNABert: Cancer prognosis from bulk RNA-seq based language models

Maxence Gélard | Guillaume Richard | Thomas Pierrot | Paul-Henry Cournède

ML4H 2024 Dec 2024
BulkRNABert pipeline. The 1st phase consists in pre-training the language model through masked language modeling using binned gene expressions. The 2nd phase fine-tunes a task-specific head using either cross-entropy for the classification task or a Cox-based loss for the survival task. IA3 rescaling is further added for the classification task.

BoostMD – Accelerating MD with MLIP

Lars L. Schaaf | Ilyes Batatia | Christoph Brunken | Thomas D. Barrett | Jules Tilly

NeurIPS 2024 workshop Dec 2024
Free energy surface of unseen alanine-dipeptide Comparison of the samples obtained by running ground truth MD and boostMD. The free energy of the Ramachandran plot, is directly related to the marginalized Boltzmann distribution exp [−F(ϕ, ψ)/kBT]. The reference model is evaluated every 10 steps. Both simulations are run for 5 ns (5 × 106 steps).

Learning the Language of Protein Structures

Benoit Gaujac | Jérémie Donà | Liviu Copoiu | Timothy Atkinson | Thomas Pierrot | Thomas D. Barrett

NeurIPS 2024 workshop Dec 2024
Schematic overview of our approach. The protein structure is first encoded as a graph to extract features from using a GNN. This embedding is then quantized before being fed to the decoder to estimate the positions of all backbone atoms.

Metalic: Meta-Learning In-Context with Protein Large Language Models

Jacob Beck | Shikha Surana | Manus McAuliffe | Oliver Bent | Thomas D. Barrett | Juan Jose Garau Luis | Paul Duckwort

NeurIPS 2024 workshop Dec 2024
We introduce Metalic, an in-context meta-learning approach for protein fitness prediction in extreme low-data settings. Critically, Metalic leverages a meta-training phase over a distribution of related fitness prediction tasks to learn how to utilize in-context sequences with protein language models (PLMs) and generalize effectively to new fitness prediction tasks. Along with fine-tuning at inference time, Metalic achieves strong performance in protein fitness prediction benchmarks, setting a new SOTA on ProteinGym, with significantly fewer parameters than baselines. Importantly, Metalic demonstrates the ability to make use of in-context learning for zero-shot tasks, further enhancing its applicability to scenarios with minimal labeled data.

Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates

Carolin Benjamins | Shikha Surana | Oliver Bent | Marius Lindauer | Paul Duckworth

NeurIPS 2024 workshop Dec 2024
: Multi-round design averaged over eight single-mutant protein landscapes. Left: Top-30% recall (mean and 95%-CI). Our methods are highlighted with ∗ . Right: Wall-clock runtime interpreted across hardware as compute costs. Our GP with string (SSK) or fingerprint (Forbes) kernels are competitive with PLM baselines whilst only requiring a fraction of runtime and no pre-training.

Multi-modal Transfer Learning between Biological Foundation Models

Juan Jose Garau-Luis | Patrick Bordes | Liam Gonzalez | Masa Roller | Bernardo P. de Almeida | Lorenz Hexemer | Christopher Blum | Stefan Laurent | Jan Grzegorzewski | Maren Lang | Thomas Pierrot | Guillaume Richard

NeurIPS 2024 Dec 2024
We demonstrate IsoFormer’s capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues.

Dispelling the Mirage of Progress in Offline MARL

Claude Formanek | Callum Rhys Tilbury | Louise Beyers | Jonathan Shock | Arnu Pretorius

NeurIPS 2024 Dec 2024
We compare our baseline implementations to the reported performance of various algorithms from the literature across a wide range of datasets. We normalise results from each dataset (i.e. scenario-quality-source combination) by the SOTA performance from the literature for that dataset. Standard deviation bars are given and when our baseline is significantly better or equal to the best method, using a two-side t-test, we indicate so using a gold star. We find that on 35 out of the 47 datasets tested (almost 75% of cases), we match or surpass the performance of the current SOTA.

SPO: Sequential Policy Optimisation

Matthew V Macfarlane | Edan Toledo | Donal Byrne | Paul Duckworth | Alexandre Laterre

NeurIPS 2024 Dec 2024
Model-based planning algorithm for both continuous and discrete sequential decision making problems

Nucleotide Transformer: building and evaluating robust foundation models for human genomics

Hugo Dalla-Torre | Liam Gonzalez | Javier Mendoza-Revilla | Nicolas Lopez Carranza | Adam Henryk Grzywaczewski | Francesco Oteri | Christian Dallago | Evan Trop | Bernardo P. de Almeida | Hassan Sirelkhatim | Guillaume Richard | Marcin Skwark | Karim Beguir | Marie Lopez  | Thomas Pierrot

Nature Methods 2024 Nov 2024
Left: Graphical representation of genomic features considered for downstream tasks to evaluate NT performance. Right: Comparison of NTs to baselines. We report Normalized mean of MCC performance across downstream tasks (divided by category) for all methods after fine-tuning.

Protein Sequence Modelling with Bayesian Flow Networks

Timothy Atkinson | Thomas D. Barrett | Scott Cameron | Bora Guloglu | Matthew Greenig | Louis Robinson | Alex Graves | Liviu Copoiu | Alexandre Laterre

Sep 2024
Application of a Bayesian Flow Network (BFN) to protein sequence modelling. BFN’s update parameters of data distribution, 𝜃, using Bayesian inference given a noised observation, y of a data sample. When applied to protein sequence modelling, the distribution over the data is given by separate categorical distributions over the possible tokens (all amino acids and special tokens such as , , and ) at each sequence index. During training, ‘Alice’ knows a ground truth data point x, and so 𝜃 can be directly updated using noised observation of x. ‘Bob’ trains a neural network to predict the ‘sender’ distribution from which Alice is sampling these observations at each step (i.e. to predict the noised ground truth). During inference, when Alice is not present, Bob replaces noised observations of the ground truth with samples from the ‘reciever’ distribution predicted by the network.

SMX: Sequential Monte Carlo Planning for Expert Iteration

Edan Toledo | Matthew Macfarlane | Donal John Byrne | Siddarth Singh | Paul Duckworth | Alexandre Laterre

ICML 2024 Jul 2024
Figure 1: Diagram depicting a representation of SMX search from left to right. N Rollouts are executed in parallel according to πθ (the sampling policy β). At each step in the environment the particle weights are adjusted, indicated by the particle sizes. We depict two resampling zones where particles are resampled (favouring higher weights) and weights are reset. Finally an improved policy π ′ = Iˆ βπ is constructed from the initial actions from the remaining particles, furthest to the right. This improved policy is then used to update πθ.