Research Papers

Bayes-PD: Exploring a Sequence to Binding Bayesian Neural Network model trained on Phage Display data

Ilann Amiaud-Plachy | Michael Blank | Oliver Bent | Sebastien Boyer

NeurIPS 2025 - Structured Probabilistic Inference & Generative Modeling Jan 2026
Figure 2 shows the phage display Poisson model.

GeoGraph: Geometric and Graph-based Ensemble Descriptors for Intrinsically Disordered Proteins

Eoin Quinn | Marco Carobene | Jean Quentin | Sebastien Boyer | Miguel Arbesú | Oliver Bent

NeurIPS 2025 Workshops Jan 2026
GeoGraph, a simulation-informed surrogate trained to predict ensembleaveraged statistics of residue–residue contact-map topology directly from sequence

Annotating the genome at single-nucleotide resolution with DNA foundation models

Bernardo P. de Almeida | Hugo Dalla-Torre | Guillaume Richard | Christopher Blum | Lorenz Hexemer | Maxence Gélard | Javier Mendoza-Revilla | Ziqi Tang | Frederikke I. Marin | David M. Emms | Priyanka Pandey | Stefan Laurent | Marie Lopez | Alexandre Laterre | Maren Lang | Uğur Şahin | Karim Beguir | Thomas Pierrot

Nature Methods (2025) Jan 2026
The SegmentNT neural network architecture consists of a pre-trained DNA encoder (here Nucleotide Transformer (NT) and a segmentation head (here a U-Net)

A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction

Sam Boshar | Benjamin Evans | Ziqi Tang | Armand Picard | Yanis Adel | Franziska K. Lorbeer | Chandana Rajesh | Tristan Karch | Shawn Sidbon | David Emms | Javier Mendoza-Revilla | Fatimah Al-Ani | Evan Seitz | Yair Schiff | Yohan Bornachot | Ariana Hernandez | Marie Lopez | Alexandre Laterre | Karim Beguir | Peter Koo | Volodymyr Kuleshov | Alexander Stark | Bernardo P. de Almeida | Thomas Pierrot

Dec 2025
NTv3 is InstaDeep’s new multi-species genomics foundation model, designed for 1 Mb, single-nucleotide-resolution prediction, and for bridging representation learning, sequence-to-function modelling, and generative regulatory design within a single framework.

Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I point-mutation ligandome evaluation

Antoine Delaunay | Miles McGibbon | Bachir Djermani | Nikolai Gorbushin | Sergio Chaves García-Mascaraque | Isaac Rayment | Ilya Kizhvatov | Cécile Petit | Maren Lang | Karim Beguir | Ugur Sahin | Liviu Copoiu | Nicolas Lopez Carranza | AndreyTovchigrechko

Scientific Reports Dec 2025

MEMENTO: Memory-Enhanced Neural Solvers for Routing Problems

Felix Chalumeau | Refiloe Shabe | Noah De Nicola | Arnu Pretorius | Thomas D. Barrett | Nathan Grinsztajn

NeurIPS 2025 (Spotlight) Nov 2025

Breaking the Performance Ceiling in ReinforcementLearning requires Inference Strategies

Felix Chalumeau | Daniel Rajaonarivonivelomanantsoa | Ruan de Kock | Claude Formanek | Sasha Abramowitz | Oumayma Mahjoub | Wiem Khlifi | Simon Du Toit | Louay Ben Nessir | Refiloe Shabe | Arnol Fokam | Siddarth Singh | Ulrich Mbou Sob | Arnu Pretorius

NeurIPS 2025 (Oral) Nov 2025

Oryx: a Scalable Sequence Model forMany-Agent Coordination in Offline MARL

Claude Formanek | Omayma Mahjoub | Louay Ben Nessir | Sasha Abramowitz | Ruan de Kock | Wiem Khlifi | Daniel Rajaonarivonivelomanantsoa | Simon Du Toit | Arnol Fokam | Siddarth Singh | Ulrich Mbou Sob | Felix Chalumeau | Arnu Pretorius

NeurIPS 2025 Nov 2025

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

Sam Boshar | Evan Trop | Bernardo P. de Almeida | Liviu Copoiu | Thomas Pierrot

Bioinformatics (2024) Sep 2025
Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.

Multi-Agent Reinforcement Learning with Selective State-Space Models

Jemma Daniel | Ruan John de Kock | Louay Ben Nessir | Sasha Abramowitz | Omayma Mahjoub | Wiem Khlifi | Juan Claude Formanek | Arnu Pretorius

AAMAS 2025 Sep 2025
The left-hand plot in Figure 1 compares MAM, MAT, and MAPPO, aggregated over all tasks and environments. MAM achieves performance on par with MAT, the current state-of-the-art, while learning faster.

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Omayma Mahjoub | Sasha Abramowitz | Ruan de Kock | Wiem Khlifi | Simon du Toit | Jemma Daniel | Louay Ben Nessir | Claude Formanek | Louise Beyers | Liam Clark | Arnu Pretorius

ICML 2025 Jul 2025
Figure 1. Performance, memory, and scaling properties of Sable compared to the Multi-Agent Transformer (MAT) (Wen et al., 2022), the previous state-of-the-art, aggregated over 45 cooperative MARL tasks. Left: Sable ranks best in 34 out of 45 tasks, outperforming all other MARL algorithms tested across 6 environments: RWARE, LBF, MABrax, SMAX, Connector, and MPE. MAT ranked best of 3/45. Middle: Sable exhibits superior throughput, processing up to 6.5 times more steps per second compared to MAT as we scale to 512 agents. Right: Sable scales efficiently to thousands of agents, maintaining stable performance, while using GPU memory significantly more efficiently than MAT.

Bimodal masked language modeling for bulk RNA-seq and DNA methylation representation learning

Maxence Gélard | Hakim Benkirane | Thomas Pierrot | Guillaume Richard | Paul-Henry Cournède

ICML 2025 Workshop Jul 2025
Figure 1: MOJO pipeline. (a) Each modality is first tokenized using linear binning. (b) MOJO, whose core architecture is composed of a mix of convolution and attention operations, is firstly pre-trained through bimodal masked language modeling. (c) Embeddings are probed from MOJO to fine-tune a task-specific head tailored for cancer-type classification or survival analysis.