New research from InstaDeep, NVIDIA and the Technical University of Munich beats expectations, provides new insights into genomics research

Published

Categories

InstaDeep is pleased to announce a new collaboration with the Technical University of Munich and NVIDIA, using the UK-based Cambridge-1 supercomputer to train large language models (LLMs) on diverse genomic datasets to examine the impact of model scale and data diversity on downstream task performance.  

As part of the work, multiple foundation models for genomics were constructed, achieving state-of-the-art results across numerous prediction challenges. Tasks such as predicting enhancer and promoter sequences and transcription factor binding sites were studied and will help in understanding the process of DNA being translated into RNA and proteins. 

These findings have exciting implications for the field of genomics, as they demonstrate that large language models can be used to effectively generalise across a wide range of tasks. This is a significant advancement, as previous approaches required the use of specialised models for each task. The use of LLMs trained on genomics data can greatly simplify the process of predicting genomic features from DNA sequences, even in low-data settings,and understanding the  biological consequences of human mutations.

Karim Beguir, InstaDeep’s Co-Founder and CEO spoke about the partnership: “We believe these are the first results that clearly demonstrate the feasibility of developing foundation models in genomics that truly generalise across tasks. In many ways, these results mirror what we have seen in the development of adaptable foundation models in natural language processing over the last few years, and it’s incredibly exciting to see this now applied to such challenging problems in drug discovery and human health.”

Superior results indicate great potential

The largest 2.5 billion parameter LLM trained on a multi-species dataset matched or outperformed specialised state-of-the-art models in 15 out of 18 tasks. The results were achieved through the use of parameter efficient fine-tuning, but even using pre-trained embeddings from transformer models in a simple model such as a shallow perceptron or logistic regression resulted in equivalent or superior performance in 11 tasks. 

The team also found that intermediate layers in the LLM often produced representations with higher performance on downstream tasks than the final layer. These findings show the potential for developing foundation models in genomics that can generalise across tasks and have significant applications in drug discovery and human health.

Key factors to improve performance

The researchers also explored the importance of sequence diversity and model scale in their study. They found that increasing either of these factors led to improved performance. For example, a 500 million parameter model trained on only the human reference genome performed worse than the same model trained on the 1000 Genomes dataset (3,200 Human Genomes). Similarly, the 2.5 billion parameter model trained on the 1000 Genomes dataset performed better than any 500 million parameter model, but not as well as the same model trained on a custom multi-species dataset, even when the downstream performance was measured on tasks concerning only the human genome.

An ongoing relationship

This announcement follows on from the 2022 news that InstaDeep had been granted access to Cambridge-1 alongside the five founding partners, enabling the company to accelerate the next wave of biology innovation, specifically to train AI language models using genomics data. 

An early draft of the results is available on bioarxiv, and full results will be described in a forthcoming publication.  An advance ”sneak peek” was presented at this week’s J.P. Morgan Healthcare Conference, by NVIDIA Healthcare VP Kimberly Powell (Thursday January 12 at 10:30 PDT). Listen to the webcast and see the slides here.

Looking ahead, the team plan to explore further downstream task performance improvements by fine-tuning the models directly, and will continue their collaboration on architectural innovations for LLMs applied to genomics. 


If you are interested in working on initiatives like this one, InstaDeep is hiring!  Please check out all our open opportunities in BioAI – and other areas of the business –  on our careers page: www.instadeep.com/careers