The genome is not a linear string of information, but a long-range, three-dimensional system in which regulatory signals act across large genomic distances to control gene expression.1 Interpreting this requires models that can connect nucleotide-level sequence to long-range regulation, cell-state–specific expression programmes, and the phenotypic consequences of genetic variation.
Meeting this challenge demands single-nucleotide resolution, with precise knowledge of where genes begin and end, how regulatory elements are organised, and how even a single nucleotide substitution can alter gene expression.
Historically, these capabilities were addressed by separate modelling paradigms rather than a single shared framework.2 Supervised models deliver high assay-specific accuracy but struggle to generalise; self-supervised language models learn broad genomic structure yet often lack direct functional prediction, and generative models enable sequence design but remain largely disconnected from genome-to-function inference.
InstaDeep’s latest innovation, Nucleotide Transformer v3 (NTv3), is a single framework that unites these once-separate capabilities. NTv3 learns representations, predicts functional readouts, annotates genomes and designs new sequences across multiple species. By reasoning across one million nucleotides at single-base resolution, it gains access to the long-range regulatory logic that shapes genomic function, moving the field from simply reading genes to engineering them.
Solving the Long-Context Problem
Many of the genome’s most consequential regulatory events unfold across long genomic distances. Enhancers, for example, can influence genes located hundreds of kilobases, or even a full megabase, away. Chromatin, the organised complex of DNA and proteins that packages the genome and regulates access to genetic information, plays a central role in this process by forming loops that bring distant regions into physical contact,3 but capturing this behaviour requires models that combine fine local sensitivity with broad contextual awareness.
NTv3 addresses this challenge through a U-Net-inspired architecture adapted for genomic scale. The model begins by compressing raw nucleotide sequences using convolutional layers that preserve fine-scale features while reducing computational load. This compact representation is then processed by a Transformer core that integrates information across the entire megabase window. Finally, a symmetric upsampling pathway reconstructs predictions back to single-nucleotide resolution.

This design allows NTv3 to remain sensitive to distant regulatory signals while producing precise, per-base predictions. Despite its scale, the architecture is highly efficient: even the largest NTv3 model with 650M parameters delivers faster inference at full megabase context than specialised long-range architectures with smaller number of parameters such as HyenaDNA and Caduceus.4
💡An enhancer is a regulatory DNA element that increases the expression of a target gene, often acting over long genomic distances by interacting with the gene’s promoter.5

A Novel Training Approach
NTv3 demonstrates that carefully aligned architecture and training objectives can outperform sheer model scale in capturing biological function.
The model is trained in two major stages. The first is a genome-scale pre-training phase using more than eight trillion nucleotides. A length curriculum gradually expands the maximum input window from kilobases to a full megabase, enabling NTv3 to learn stable and coherent representations across sequence scales and to acquire a broad understanding of genomic organisation spanning all domains of life.
The second training stage introduces biological supervision from two complementary data types. Around 16,000 functional tracks are combined with genome annotations from 24 species, allowing the model to learn how sequence relates to real molecular activity. Species-conditional normalisation provides information about the organism being analysed, enabling a single set of weights to adapt to distinct regulatory patterns across species. This joint objective teaches NTv3 to connect raw sequence to both molecular phenotype and structural genome annotation, supporting multiple tasks without architectural modification.
Together, these training stages equip NTv3 to perform functional prediction, genome annotation and representation learning within a single backbone. The model generalises strongly across mammals, invertebrates and major crops, while remaining readily adaptable to species never encountered during training. Importantly, this post-training strategy preserves compatibility with masked language modelling (MLM), allowing the same backbone to be fine-tuned into a controllable generative model without architectural changes.

State-of-the-Art Performance and Efficiency
NTv3 highlights the potential of efficient genome-scale modelling. This is demonstrated by the smallest variant, which has just eight million parameters yet performs competitively against far larger models, including outperforming the one-billion-parameter Evo2 and our previous 500-million-parameter NTv2.
On the new publicly available NTv3 Benchmark, which spans 106 tasks across seven species, NTv3 achieves state-of-the-art accuracy in both functional track prediction and genome annotation. On additional published benchmarks, it reproduces complex assay signals with high fidelity and identifies gene-structure elements with precision, outperforming established tools such as our SegmentNT, as well as AUGUSTUS and SpliceAI. The model also performs strongly in cross-species settings, achieving high accuracy on held-out species such as cattle and tomato despite no exposure to their data during training.
NTv3 further excels on gene-level prediction tasks central to agricultural genomics. When predicting gene expression and protein abundance from promoter-proximal sequence, it not only outperforms other foundation models but also surpasses our domain-specialised one-billion-parameter model, AgroNT. Performance consistently improves with longer context, highlighting the importance of capturing distal regulatory effects that shorter models overlook, while reinforcing NTv3’s effectiveness as a generalist model.

Mechanistic Interpretation
Crucially, NTv3’s generative capability is grounded in mechanistic understanding. Rather than merely recognising statistical patterns, NTv3 recovers biologically meaningful regulatory relationships. It identifies enhancer–promoter interactions consistent with experimental characterisation, including the HS1–HS5 hierarchy at the HBE1 locus. It explains the impact of genetic variants by attributing changes to motif disruption, motif creation, or broader regulatory rewiring, and distinguishes pathogenic eQTLs from benign variants through systematic differences in attribution patterns. In more complex cases, such as pathogenic splice variants, NTv3 integrates predictions, annotation masks, saliency and attention maps to reconstruct coherent, biologically plausible mechanisms of action.
💡The HS1–HS5 hierarchy is a structured set of regulatory elements that collectively control when and how globin genes are activated across development. In NTv3, this regulatory organisation is recovered at the HBE1 locus, a developmentally regulated globin gene whose expression depends on long-range enhancer control.6
💡Pathogenic eQTLs are genetic variants (expression quantitative trait loci) that influence gene expression levels and cause, or significantly increase, the risk of developing a specific disease or disorder.7
DNA Design
Building on this understanding, NTv3 marks a shift from sequence modeling to design. Using masked-diffusion language modelling, the model can be adapted into a generative framework capable of producing new DNA sequences that satisfy detailed functional constraints. This establishes a direct bridge between genome understanding and genome engineering within a single, unified framework. To test NTv3 capabilities for practical sequence design, NTv3 was used to design enhancer sequences with specified activity levels and promoter selectivity. Generated sequences were tested experimentally using STARR-seq, a lab technique that tests whether a DNA sequence functions as an enhancer by measuring how much gene expression it produces in cells.8 Two specific properties were tested:
- Activity-stratified enhancer design: NTv3 produced enhancer sequences targeting specific activity levels. In silico predictions closely matched held-out experimental distributions, while in vitro assays confirmed clear stratification across activity tiers.
- Promoter-specific enhancer design: By conditioning generation on promoter context, NTv3 produced enhancers that activate strongly with one promoter while remaining weak with another. These sequences showed greater promoter specificity than those identified through oracle-guided filtering, indicating that NTv3 captures promoter–enhancer compatibility rules.
Together, these results demonstrate that NTv3 leverages its internal regulatory understanding to generate sequences that withstand experimental validation, an essential step towards practical genome engineering.

What Comes Next
NTv3 contributes to a future in which genomic interpretation and genome engineering are no longer separate disciplines. In support of this vision, and continuing our longstanding commitment to open-source AI, we have released the full NTv3 model family alongside a suite of practical resources. These include pre-trained and post-trained checkpoints, as well as PyTorch notebooks covering long-context inference, functional-track prediction, genome annotation, variant analysis and guided sequence generation for enhancer design.
Beyond the immediate release, the broader significance of this work lies in its potential applications. NTv3 represents a step towards a new era of computational biology that supports personalised medicine, agricultural innovation and synthetic biology, one in which understanding and engineering evolve together, hand in hand.
🚀Ready to discover NTv3? Explore the model checkpoints on HuggingFace, access the code on GitHub, and read the paper today!
Disclaimer: All claims made are supported by our research paper: A foundational model for joint sequence-function multi-species modeling at scale for long-range genomic prediction unless explicitly cited otherwise.
- Boshar, S. et al. A foundational model for joint sequence–function multi-species modeling at scale for long-range genomic prediction. bioRxiv (2025).https://bit.ly/4kNgHwO p. 1 ↩︎
- Boshar, S. et al. A foundational model for joint sequence–function multi-species modeling at scale for long-range genomic prediction. bioRxiv (2025). https://bit.ly/3MGPnn0 p. 2 ↩︎
- National Cancer Institute. Chromatin. In NCI Dictionary of Cancer Terms, U.S. Department of Health and Human Services. Available from:https://bit.ly/3OkG24Y ↩︎
- Boshar, S. et al. A foundational model for joint sequence–function multi-species modeling at scale for long-range genomic prediction. bioRxiv (2025).https://bit.ly/40ibLq0, p. 4 ↩︎
- Balasubramanian, P. et al. Enhancer–promoter interactions form independently of genomic distance and are functional across topological domain boundaries. bioRxiv (2022). Available from: https://bit.ly/4rVF9OE ↩︎
- HBB-LCR beta-globin locus control region [Homo sapiens]. NCBI Gene. Gene ID: 109580095. Available from: https://bit.ly/4rRv0Cx ↩︎
- Jia, Z. et al. eQTL analysis: A bridge from genome to mechanism. Genes & Diseases (in press, 2025). https://bit.ly/46fJ1BQ ↩︎
- Muerdter, F., et al. STARR-seq – Principles and applications. Genomics 106, 145–150 (2015). https://bit.ly/46be6qh ↩︎
