Research Papers

InstaNovo-P: A de novo peptide sequencing model for phosphoproteomics

Jesper Lauridsen | Pathmanaban Ramasamy | Rachel Catzel | Vahap Canbay | Amandla Mabona | Kevin Eloff | Paul Fullwood | Jennifer Ferguson | Annekatrine Kirketerp-Møller | Ida Sofie Goldschmidt | Tine Claeys | Sam van Puyenbroeck | Santiago Nicolas Lopez Carranza | Erwin M. Schoof | Lennart Martens | Jeroen Van Goey | Chiara Frankavilla | Timothy Patrick Jenkins | Konstantinos Kalogeropoulos

Jul 2025
InstaNovo-P: A de novo peptide sequencing model for phosphoproteomics

ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks

Bernardo P. de Almeida | Guillaume Richard | Hugo Dalla-Torre | Christopher Blum | Lorenz Hexemer | Priyanka Pandey | Stefan Laurent | Chandana Rajesh | Marie Lopez | Alexandre Laterre | Maren Lang | Uğur Şahin | Karim Beguir | Thomas Pierrot

Nature Machine Intelligence (2025) Jul 2025
ChatNT, the first multimodal conversational agent with an advanced understanding of biological sequences. ChatNT achieves new state-of-the-art results on the Nucleotide Transformer benchmark while being able to solve all tasks at once, in English, and to generalize to unseen questions.

Unified framework for matchgate classical shadows

Valentin Heyraud | Héloise Chomet | Jules Tilly

npj Quantum Information Jul 2025
Unified framework for matchgate classical shadows

Leveraging State Space Models in Long Range Genomics

Matvei Popov | Aymen Kallala | Anirudha Ramesh | Narimane Hennouni | Shivesh Khaitan | Rick Gentry | Alain-Sam Cohen

ICLR LMRL (2025) May 2025
Comparison of the extrapolation methods of state-space models and attention-based models on VEP eQTLs (AUROC). For NTv2, we also reported an inference-time extrapolation method: position interpolation. A dotted vertical line indicates the fine-tuning sequence length (12 kbp) of all models. Attention-based models collapse when processing sequences that are longer than what they have encountered at training time, whereas state-space models show an ability to generalize to sequences up to 10x longer. Lines that turn into dotted indicate values that we were unable to compute due to computational cost constraints and are therefore assumed based on trends.

Open-Source and FAIR Research Software for Proteomics

Lukas Käll | Yasset Perez-Riverol | Wout Bittremieux | William S. Noble | Lennart Martens | Aivett Bilbao | Michael R. Lazear | Bjorn Grüning | Daniel S. Katz | Michael J. MacCoss | Chengxin Dai | Jimmy K. Eng | Robbin Bouwmeester | Michael R. Shortreed | Enrique Audain | Timo Sachsenberg | Jeroen Van Goey | Georg Wallmann | Bo Wen | William E. Fondrie

May 2025
Open-source software (OSS), aligned with the FAIR Principles (Findable, Accessible, Interoperable, Reusable), offers a solution by promoting transparency, reproducibility, and community-driven development, which fosters collaboration and continuous improvement. In this manuscript, we explore the role of OSS in computational proteomics, its alignment with FAIR principles, and its potential to address challenges related to licensing, distribution, and standardization.

AbBFN2: A flexible antibody foundation model based on Bayesian Flow Networks

Bora Guloglu | Miguel Bragança | Alex Graves | Scott Cameron | Timothy Atkinson | Liviu Copoiu | Alexandre Laterre | Thomas D. Barrett

May 2025

Metalic: Meta-Learning In-Context with Protein Language Models

Jacob Beck | Shikha Surana | Manus McAuliffe | Oliver Bent | Thomas D. Barrett | Juan Jose Garau Luis | Paul Duckworth

ICLR 2025 Apr 2025
Our method, called Metalic (Meta-Learning In-Context), uses in-context learning and fine-tuning, when data is available, to adapt to new tasks.

Simple Guidance Mechanisms for Discrete Diffusion Models

Hugo Dalla-Torre | Sam Boshar | Bernardo P. de Almeida | Thomas Pierrot | Yair Schiff | Subham Sekhar Sahoo | Hao Phung | Guanghan Wang | Alexander Rush | Volodymyr Kuleshov

ICLR 2025 Apr 2025
Guidance mechanisms for discrete diffusion

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

Kevin Eloff | Konstantinos Kalogeropoulos | Oliver Morell | Amandla Mabona | Jakob Berg Jespersen | Wesley WIlliams | Sam P. B. van Beljouw | Marcin Skwark | Andreas Hougaard Laustsen | Stan J. J. Brouns | Stan J. J. Brouns | Erwin M. Schoof | Jeroen Van Goey | Ulrich auf dem Keller | Karim Beguir | Nicolas Lopez Carranza | Timothy P. Jenkins

Nature Machine Intelligence Mar 2025

Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean

Carolin Benjamins | Shikha Surana | Oliver Bent | Marius Lindauer | Paul Duckworth

NeurIPS 2024 workshop Dec 2024
Bayes Opt for Protein Design

BulkRNABert: Cancer prognosis from bulk RNA-seq based language models

Maxence Gélard | Guillaume Richard | Thomas Pierrot | Paul-Henry Cournède

ML4H 2024 Dec 2024
BulkRNABert pipeline. The 1st phase consists in pre-training the language model through masked language modeling using binned gene expressions. The 2nd phase fine-tunes a task-specific head using either cross-entropy for the classification task or a Cox-based loss for the survival task. IA3 rescaling is further added for the classification task.

BoostMD – Accelerating MD with MLIP

Lars L. Schaaf | Ilyes Batatia | Christoph Brunken | Thomas D. Barrett | Jules Tilly

NeurIPS 2024 workshop Dec 2024
Free energy surface of unseen alanine-dipeptide Comparison of the samples obtained by running ground truth MD and boostMD. The free energy of the Ramachandran plot, is directly related to the marginalized Boltzmann distribution exp [−F(ϕ, ψ)/kBT]. The reference model is evaluated every 10 steps. Both simulations are run for 5 ns (5 × 106 steps).