Research Papers Archive | InstaDeep - Decision-Making AI For The Enterprise

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

Nature Machine Intelligence Mar 2025

FULL PAPER GITHUB Read More

Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean

Carolin Benjamins | Shikha Surana | Oliver Bent | Marius Lindauer | Paul Duckworth

NeurIPS 2024 workshop Dec 2024

FULL PAPER Read More

BulkRNABert: Cancer prognosis from bulk RNA-seq based language models

Maxence Gélard | Guillaume Richard | Thomas Pierrot | Paul-Henry Cournède

ML4H 2024 Dec 2024

BulkRNABert pipeline. The 1st phase consists in pre-training the language model through masked language modeling using binned gene expressions. The 2nd phase fine-tunes a task-specific head using either cross-entropy for the classification task or a Cox-based loss for the survival task. IA3 rescaling is further added for the classification task.

FULL PAPER GITHUB Read More

BoostMD – Accelerating MD with MLIP

Lars L. Schaaf | Ilyes Batatia | Christoph Brunken | Thomas D. Barrett | Jules Tilly

NeurIPS 2024 workshop Dec 2024

Free energy surface of unseen alanine-dipeptide Comparison of the samples obtained by running ground truth MD and boostMD. The free energy of the Ramachandran plot, is directly related to the marginalized Boltzmann distribution exp [−F(ϕ, ψ)/kBT]. The reference model is evaluated every 10 steps. Both simulations are run for 5 ns (5 × 106 steps).

FULL PAPER Read More

Learning the Language of Protein Structures

NeurIPS 2024 workshop Dec 2024

Schematic overview of our approach. The protein structure is first encoded as a graph to extract features from using a GNN. This embedding is then quantized before being fed to the decoder to estimate the positions of all backbone atoms.

FULL PAPER GITHUB Read More

Metalic: Meta-Learning In-Context with Protein Large Language Models

NeurIPS 2024 workshop Dec 2024

We introduce Metalic, an in-context meta-learning approach for protein fitness prediction in extreme low-data settings. Critically, Metalic leverages a meta-training phase over a distribution of related fitness prediction tasks to learn how to utilize in-context sequences with protein language models (PLMs) and generalize effectively to new fitness prediction tasks. Along with fine-tuning at inference time, Metalic achieves strong performance in protein fitness prediction benchmarks, setting a new SOTA on ProteinGym, with significantly fewer parameters than baselines. Importantly, Metalic demonstrates the ability to make use of in-context learning for zero-shot tasks, further enhancing its applicability to scenarios with minimal labeled data.

FULL PAPER GITHUB Read More

Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates

Carolin Benjamins | Shikha Surana | Oliver Bent | Marius Lindauer | Paul Duckworth

NeurIPS 2024 workshop Dec 2024

$: Multi-round design averaged over eight single-mutant protein landscapes. Left: Top-30% recall (mean and 95%-CI). Our methods are highlighted with ∗ . Right: Wall-clock runtime interpreted across hardware as compute costs. Our GP with string (SSK) or fingerprint (Forbes) kernels are competitive with PLM baselines whilst only requiring a fraction of runtime and no pre-training.$

FULL PAPER Read More

Multi-modal Transfer Learning between Biological Foundation Models

NeurIPS 2024 Dec 2024

We demonstrate IsoFormer’s capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues.

FULL PAPER Read More

Dispelling the Mirage of Progress in Offline MARL

Claude Formanek | Callum Rhys Tilbury | Louise Beyers | Jonathan Shock | Arnu Pretorius

NeurIPS 2024 Dec 2024

We compare our baseline implementations to the reported performance of various algorithms from the literature across a wide range of datasets. We normalise results from each dataset (i.e. scenario-quality-source combination) by the SOTA performance from the literature for that dataset. Standard deviation bars are given and when our baseline is significantly better or equal to the best method, using a two-side t-test, we indicate so using a gold star. We find that on 35 out of the 47 datasets tested (almost 75% of cases), we match or surpass the performance of the current SOTA.

FULL PAPER GITHUB Read More

SPO: Sequential Policy Optimisation

Matthew V Macfarlane | Edan Toledo | Donal Byrne | Paul Duckworth | Alexandre Laterre

NeurIPS 2024 Dec 2024

Model-based planning algorithm for both continuous and discrete sequential decision making problems

FULL PAPER Read More

Nucleotide Transformer: building and evaluating robust foundation models for human genomics

Nature Methods 2024 Nov 2024

Left: Graphical representation of genomic features considered for downstream tasks to evaluate NT performance. Right: Comparison of NTs to baselines. We report Normalized mean of MCC performance across downstream tasks (divided by category) for all methods after fine-tuning.

FULL PAPER Read More

Protein Sequence Modelling with Bayesian Flow Networks

Sep 2024

Application of a Bayesian Flow Network (BFN) to protein sequence modelling. BFN’s update parameters of data distribution, 𝜃, using Bayesian inference given a noised observation, y of a data sample. When applied to protein sequence modelling, the distribution over the data is given by separate categorical distributions over the possible tokens (all amino acids and special tokens such as , , and ) at each sequence index. During training, ‘Alice’ knows a ground truth data point x, and so 𝜃 can be directly updated using noised observation of x. ‘Bob’ trains a neural network to predict the ‘sender’ distribution from which Alice is sampling these observations at each step (i.e. to predict the noised ground truth). During inference, when Alice is not present, Bob replaces noised observations of the ground truth with samples from the ‘reciever’ distribution predicted by the network.

FULL PAPER Read More