Nucleotides are the fundamental units of DNA, and when linked together by a sugar-phosphate backbone, they form the strands that define our genome.
Analysing the precise role of each nucleotide within these sequences is essential to understanding their influence on gene regulation and disease. However, the human genome contains around 3 billion nucleotides in a single set of chromosomes, making interpretation extraordinarily challenging.
Effective interpretation hinges on knowing exactly where each feature lies in the sequence: the order of A’s, T’s, C’s, and G’s; where a gene begins; and where regulatory signals are located, all at single-nucleotide resolution.
Sequence-based machine learning models trained on large-scale genomics data have shown impressive performance in capturing patterns and predicting molecular phenotypes. However, most remain narrow specialists, each optimised for one task—such as detecting splice sites or promoters—but not both.
SegmentNT aims to address this need as a versatile foundation model for high-resolution, multi-task genome annotation.
What is SegmentNT?
As seen in Nature Methods, SegmentNT is an extension of our established Nucleotide Transformer (NT) models, optimised for genomic analysis at single-nucleotide resolution.
Capable of handling sequences up to 50kb, SegmentNT can locate 14 distinct classes of genomic and regulatory elements at once. This includes key gene components such as exons, introns, and splice sites, enabling the model to infer changes to transcript isoforms and assess the potential impact of mutations.
💡1 kilobase (kb) = 1,000 nucleotides. A 50kb sequence contains 50,000 nucleotides.
Our researchers combined a pre-trained NT foundation model encoder with a specialised neural network known as a 1D U-Net to create SegmentNT. The model is trained end-to-end using focal loss, a formula that helps it prioritise rare or hard-to-classify elements by reducing the influence of well-known examples. This is particularly important in genomics, where features like splice sites or polyA signals are relatively scarce. The encoder processes DNA by breaking it into overlapping chunks of six nucleotides and extracting meaningful patterns from each. As a result, SegmentNT can detect genomic features at multiple scales across entire DNA sequences in a single pass.

Performance
We benchmarked SegmentNT against several widely used bioinformatics tools to evaluate its performance across individual annotation tasks.
For splice site detection, SegmentNT outperformed SpliceAI, a specialised deep learning model, achieving an MCC across the genome of 0.75 for splice acceptor and 0.76 for donor sites, compared to SpliceAI’s 0.67 and 0.59. When testing on promoter and enhancer prediction, SegmentNT exceeded approaches that use model classifiers (such as DeepPromoter) in a sliding-window setting to derive nucleotide-level predictions. SegmentNT also showed strong capabilities for gene component identification, particularly on exon-intron structures, outperforming classical gene prediction tools like Augustus on held-out test sets.
However, unlike these task-specific tools, SegmentNT was designed to detect all 14 genomic and regulatory element types simultaneously. To evaluate its performance in this broader setting and assess the importance of individual components, we trained two 1D U-Nets without the pre-trained NT encoder:
- A 63M parameter version, which achieved an MCC of just 0.07
- A larger 252M parameter version, which reached 0.10
We also tested a SegmentNT variant with a randomly initialised NT encoder. Even after training on 34 billion tokens, it plateaued at 0.15, highlighting the value of integrating a pre-trained DNA encoder into the architecture. In contrast, SegmentNT achieved an average MCC of 0.38 on 3kb sequences.
💡 MCC is a measure of how well a model’s predictions match real biological data. A score of 1 indicates perfect prediction.
When compared to other end-to-end deep learning models, SegmentNT-30kb was over 300 times faster than sliding binary classifiers (DeePromoter and NTV2) across the full sequence. For a 30kb input, it produces 420,000 predictions, assigning probabilities to each nucleotide across 14 element types, in just 0.009 seconds. These results support SegmentNT’s capability to deliver both precision and scalability across large genomic regions.

Applications
SegmentNT is designed with real-world genomics research in mind. To explore its broader utility, we evaluated its performance across longer sequences, splice site contexts, and species diversity.
Extending sequence context
The original NT was trained with rotary positional embeddings (RoPE) for up to 12kb inputs. By scaling RoPE frequencies, SegmentNT can handle much longer sequences. As context increased from 3kb to 30kb, MCC rose from 0.38 to 0.46. When we applied SegmentNT-10kb to 100kb sequences, performance climbed from MCC 0.07 to 0.26. SegmentNT-30kb performed best, peaking at 0.47 on 50kb and holding 0.45 at 100kb, equivalent to 700,000 nucleotide-level predictions per sequence.
To assess how different DNA encoders influence segmentation performance, we compared SegmentNT to modified architectures that use representations from Enformer and Borzoi. These long-range supervised models were originally trained on epigenomic and transcriptomic data. We incorporated their final-layer outputs into the same segmentation framework, resulting in SegmentEnformer and SegmentBorzoi.
Using 30kb inputs, SegmentNT outperformed both alternatives across most genomic features. It achieved a higher average MCC (0.45) compared to SegmentEnformer (0.34) and SegmentBorzoi (0.35). The advantage was particularly clear for fine-scale elements like splice sites and polyA signals, where single-nucleotide resolution is critical.
Both models showed improved results when run at longer input lengths, 196kb for SegmentEnformer and 524kb for SegmentBorzoi, particularly for extended genomic regions such as introns, lncRNAs, and untranslated regions. However, their overall accuracy remained lower than SegmentNT across all tested element types.
Splice sites
We tested SegmentNT’s splice site detection ability against SpliceAI using its original benchmark dataset, where SegmentNT-30kb performed comparably. However, when evaluated on a broader whole-genome dataset, including intergenic regions and reverse-strand genes, SegmentNT demonstrated a clear advantage, achieving an MCC of 0.75 (acceptor) and 0.76 (donor), compared to SpliceAI’s 0.67 and 0.59.
SegmentNT also predicts exons and introns directly, enabling accurate reconstruction of transcript isoforms and supporting downstream analyses such as variant interpretation and gene expression modelling.
Multi-species zero shot generalisation
Trained initially on human annotations, SegmentNT demonstrated strong zero-shot generalisation to conserved features across a range of species. To extend this capability, a multispecies variant, SegmentNT-30kb-multispecies, was developed by fine-tuning on genic elements from a diverse set of vertebrate and invertebrate organisms.
This fine-tuned model improved generalisation to evolutionarily distant species, increasing the average MCC from 0.49 to 0.57 for those with more than 100 million years of divergence. For species more closely related to humans, it also achieved slightly higher accuracy, with an average MCC of 0.64 compared to 0.62.
Although trained exclusively on animal genomes, the multispecies model performed well on several held-out plant species, improving the average MCC from 0.34 to 0.45. This suggests that the model captures genomic representations that are transferable across diverse genome structures.

What’s next?
SegmentNT could open the door to flexible, high-resolution genome interpretation across species, tasks, and contexts. From isoform prediction and variant impact assessment to multispecies annotation, it offers the possibility to bring scalable genomic insight closer to reality.1
Dive into SegmentNT today! Download the paper and explore the model on GitHub and HuggingFace.
Disclaimer: All claims made are supported by our research paper: Annotating the genome at single-nucleotide resolution with DNA foundation models unless explicitly cited otherwise.
1A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C.
Berg, W.-Y. Lo, P. Doll´ ar, and R. Girshick, “Segment anything,” 2023. ↩︎
