Research Papers

Generalizable direct protein sequencing with InstaNexus

Marco Reverenna | Maike Wennekers Nielsen | Darian Stephan Wolff | Jemma Daniel | Elpida Lytra | Suthimon Thumtecho | Pasquale D. Colaianni | Anne Ljungars | Andreas H. Laustsen | Erwin M. Schoof | Jeroen Van Goey | Timothy P. Jenkins | Marie V. Lukassen | Alberto Santos | Konstantinos Kalogeropoulos

Molecular & Cellular Proteomics 2026 Mar 2026
The paper introduces InstaNexus, an optimized, end-to-end workflow for direct protein sequencing. It combines multi-protease sample preparation, AI-driven de novo peptide sequencing using InstaNovo, and a novel assembly pipeline to reconstruct contiguous protein sequences. InstaNexus demonstrates high accuracy and coverage across diverse proteins like nanobodies, antibodies, and de novo designed binders, offering promising applications in therapeutic discovery and immune profiling without relying on reference genomes.

Protein sequence modelling with Bayesian flow networks

Timothy Atkinson | Thomas D. Barrett | Scott Cameron | Bora Guloglu | Matthew Greenig | Charlie B. Tan | Louis Robinson | Alex Graves | Liviu Copoiu | Alexandre Laterre

Nature Communications (2025) Feb 2026
Figure shows the application of a Bayesian Flow Network (BFN) to protein-sequence modelling

Annotating the genome at single-nucleotide resolution with DNA foundation models

Bernardo P. de Almeida | Hugo Dalla-Torre | Guillaume Richard | Christopher Blum | Lorenz Hexemer | Maxence Gélard | Javier Mendoza-Revilla | Ziqi Tang | Frederikke I. Marin | David M. Emms | Priyanka Pandey | Stefan Laurent | Marie Lopez | Alexandre Laterre | Maren Lang | Uğur Şahin | Karim Beguir | Thomas Pierrot

Nature Methods (2025) Jan 2026
The SegmentNT neural network architecture consists of a pre-trained DNA encoder (here Nucleotide Transformer (NT) and a segmentation head (here a U-Net)

GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

Eoin Quinn | Marco Carobene | Jean Quentin | Sebastien Boyer | Miguel Arbesú | Oliver Bent

NeurIPS 2025 Workshops Dec 2025
GeoGraph, a simulation-informed surrogate trained to predict ensembleaveraged statistics of residue–residue contact-map topology directly from sequence

A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction

Sam Boshar | Benjamin Evans | Ziqi Tang | Armand Picard | Yanis Adel | Franziska K. Lorbeer | Chandana Rajesh | Tristan Karch | Shawn Sidbon | David Emms | Javier Mendoza-Revilla | Fatimah Al-Ani | Evan Seitz | Yair Schiff | Yohan Bornachot | Ariana Hernandez | Marie Lopez | Alexandre Laterre | Karim Beguir | Peter Koo | Volodymyr Kuleshov | Alexander Stark | Bernardo P. de Almeida | Thomas Pierrot

Dec 2025
NTv3 is InstaDeep’s new multi-species genomics foundation model, designed for 1 Mb, single-nucleotide-resolution prediction, and for bridging representation learning, sequence-to-function modelling, and generative regulatory design within a single framework.

Bayes-PD: Exploring a Sequence to Binding Bayesian Neural Network model trained on Phage Display data

Ilann Amiaud-Plachy | Michael Blank | Oliver Bent | Sebastien Boyer

NeurIPS 2025 workshop Dec 2025
Figure 2 shows the phage display Poisson model.

Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I point-mutation ligandome evaluation

Antoine Delaunay | Miles McGibbon | Bachir Djermani | Nikolai Gorbushin | Sergio Chaves García-Mascaraque | Isaac Rayment | Ilya Kizhvatov | Cécile Petit | Maren Lang | Karim Beguir | Ugur Sahin | Liviu Copoiu | Nicolas Lopez Carranza | AndreyTovchigrechko

Scientific Reports Dec 2025

MEMENTO: Memory-Enhanced Neural Solvers for Routing Problems

Felix Chalumeau | Refiloe Shabe | Noah De Nicola | Arnu Pretorius | Thomas D. Barrett | Nathan Grinsztajn

NeurIPS 2025 (Spotlight) Nov 2025

Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Felix Chalumeau | Daniel Rajaonarivonivelomanantsoa | Ruan de Kock | Claude Formanek | Sasha Abramowitz | Oumayma Mahjoub | Wiem Khlifi | Simon Du Toit | Louay Ben Nessir | Refiloe Shabe | Arnol Fokam | Siddarth Singh | Ulrich Mbou Sob | Arnu Pretorius

NeurIPS 2025 (Oral) Nov 2025

Oryx: a Scalable Sequence Model forMany-Agent Coordination in Offline MARL

Claude Formanek | Omayma Mahjoub | Louay Ben Nessir | Sasha Abramowitz | Ruan de Kock | Wiem Khlifi | Daniel Rajaonarivonivelomanantsoa | Simon Du Toit | Arnol Fokam | Siddarth Singh | Ulrich Mbou Sob | Felix Chalumeau | Arnu Pretorius

NeurIPS 2025 Nov 2025

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

Sam Boshar | Evan Trop | Bernardo P. de Almeida | Liviu Copoiu | Thomas Pierrot

Bioinformatics (2024) Sep 2025
Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.

Multi-Agent Reinforcement Learning with Selective State-Space Models

Jemma Daniel | Ruan John de Kock | Louay Ben Nessir | Sasha Abramowitz | Omayma Mahjoub | Wiem Khlifi | Juan Claude Formanek | Arnu Pretorius

AAMAS 2025 Sep 2025
The left-hand plot in Figure 1 compares MAM, MAT, and MAPPO, aggregated over all tasks and environments. MAM achieves performance on par with MAT, the current state-of-the-art, while learning faster.