A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction

Sam Boshar | Benjamin Evans | Ziqi Tang | Armand Picard | Yanis Adel | Franziska K. Lorbeer 1 | Chandana Rajesh | Tristan Karch | Shawn Sidbon | David Emms | Javier Mendoza-Revilla | Fatimah Al-Ani | Evan Seitz | Yair Schiff 3 | Yohan Bornachot | Ariana Hernandez | Marie Lopez | Alexandre Laterre | Karim Beguir | Peter Koo 4 | Volodymyr Kuleshov 3 | Alexander Stark 1,2 | Bernardo P. de Almeida | Thomas Pierrot

1 Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Vienna, Austria | 2 Medical University of Vienna, Vienna BioCenter (VBC), Vienna, Austria | 3 Cornell Tech, New York, USA | 4 Cold Spring Harbor Laboratory, New York, USA

Published

ABSTRACT

Genomic prediction and design require models that integrate local sequence features with long-range regulatory dependencies spanning hundreds of kilobases to megabases. Existing approaches have made substantial progress along complementary axes: supervised sequence-to-function models achieve high accuracy for specific assays and organisms, self-supervised genomic foundation models learn transferable representations from large-scale sequence data, and conditional generative models enable principled sequence design guided by functional objectives. However, these strengths are typically realized in isolation—across distinct model classes, architectures, and training regimes—limiting the ability to combine long-context, base-resolution prediction, functional modeling, and controllable generation within a single efficient framework that generalizes across organisms and modalities.

Here we introduce Nucleotide Transformer v3 (NTv3), a multi-species foundation model that unifies representation learning, functional-track and genome-annotation prediction, and controllable sequence generation within a common backbone. NTv3 uses a U-Net–like architecture to enable single-base tokenization and efficient modeling of contexts up to 1 Mb. We pretrain NTv3 on 9 trillion base pairs from OpenGenome2 using base-resolution masked language modeling, followed by post-training with a joint objective that integrates continued self-supervision with supervised learning on ∼16,000 functional tracks and annotation labels from 24 animal and plant species. After post-training, NTv3 achieves state-of-the-art accuracy for functional-track prediction and genome annotation across species, outperforming leading sequence-to-function and foundation-model baselines on established benchmarks and on the new Ntv3 Benchmark, a controlled downstream fine-tuning suite in a standardized 32 kb input / base-resolution output setting. We further show that NTv3 consolidates a shared regulatory grammar across tasks, enabling coherent long-range genome-to-function inference and variant-associated remodeling. Finally, we fine-tune NTv3 into a controllable generative model via masked diffusion language modeling and use it to design enhancer sequences with specified activity levels nd promoter selectivity. We validate these designs experimentally by STARR-seq, showing that generated enhancers recapitulate the intended activity stratification and achieve the desired promoter-specific activation in cellulo. We release the NTv3 model family together with code and practical cookbooks for long-context training, multispecies post-training, fine-tuning, interpretation, and sequence design.