InstaDeep showcases 8 papers at NeurIPS 2024

InstaDeep presents 8 papers at NeurIPS 2024

Published

Categories

From our beginnings to becoming a key player in AI, supporting new research spaces and sharing what we’ve learnt in AI has always been at the core of InstaDeep’s DNA.

This year, our journey brings us to the Vancouver Convention Center, Canada for the 38th annual Neural Information Processing Systems (NeurIPS) conference from December 10 – 15, 2024. As one of the premier gatherings for Machine Learning and AI research experts from around the globe, NeurIPS provides a platform for exchanging ideas and showcasing innovation within the AI community. 

As the conference is set to start, our team is gearing up to present their latest research with three main track papers and five papers in workshops focused on AI for Life Sciences and Decision-Making. As we join the world’s leading minds in ML, this year’s research showcases our collaboration and innovation, particularly in AI and biology. 

Here’s a quick overview of what we’re bringing to NeurIPS 2024!

Main Track Papers

SPO: Sequential Policy Optimisation

Sequential Monte Carlo Policy Optimization (SPO) is a model-based RL algorithm that uses sample-based search for policy improvement, overcoming scalability issues common in tree-based methods like MCTS and Alpha Zero. SPO leverages the Expectation-Maximization framework to deliver strong performance across both continuous and discrete tasks, outperforming baselines. Its parallelizable design allows it to efficiently scale on hardware accelerators, ensuring faster training and inference times while maintaining robust policy improvements.

Multi-modal Transfer Learning between Biological Foundation Models 

We introduce IsoFormer, a multi-modal model that bridges DNA, RNA, and protein data using pre-trained encoders to address complex genomics tasks. Applied to predict differential RNA transcript expression across human tissues, IsoFormer surpasses existing methods by effectively transferring knowledge across modalities. This approach opens new possibilities for multi-modal gene expression modeling in computational biology.

Dispelling the Mirage of Progress in Offline MARL

Offline multi-agent reinforcement learning (MARL) shows great potential, but inconsistencies in current baselines and evaluation protocols limit progress and make it hard to track real advancements. We identify key weaknesses in existing evaluation methods and show that well-implemented, straightforward baselines can often match or outperform state-of-the-art algorithms across 75% of tested datasets. To address these gaps, we provide a standardized evaluation method and baseline implementations, setting a clearer benchmark for future research in offline MARL.

Workshop Papers

Learning the Language of Protein Structures

Protein structure modeling is challenging due to its continuous and three-dimensional nature. To address this, we propose a vector-quantized autoencoder that converts protein structures into discrete tokens, simplifying the complex structure space into a discrete, manageable format. Using a codebook of 4,096 to 64,000 tokens, our model achieves highly accurate reconstructions, with backbone RMSDs of 1-5 Å. Additionally, we show that a GPT model trained on these tokenized representations can generate novel, diverse, and designable protein structures. This approach not only captures protein structures effectively but also bridges gaps between different data modalities, advancing computational methods in protein design.

BoostMD – Accelerating MD with MLIP

Modeling atomic-scale processes in biology and materials science is often limited by slow simulation speeds. BoostMD, a new machine learning architecture, addresses this by using node features from prior steps to predict forces and energies, enabling a smaller, faster model between reference model evaluations. BoostMD achieves up to 8x faster speeds while maintaining accuracy on unseen dipeptides, making it a reliable tool for long-timescale molecular simulations.

  • Workshop: Data-driven and Differentiable Simulations, Surrogates, and Solvers (D3S3) | Machine Learning for Structural Biology (MLSB)
  • Read the full paper here: https://openreview.net/pdf?id=H0USH61HnF 
  • Poster: 
    • #D3S3📍Meeting 116-117 | Sunday 15 December
    • MLSB 📍MTG Rooms 11 & 12 | Sunday 15 December

Metalic: Meta-Learning In-Context with Protein Large Language Models 

We introduce Metalic, an in-context meta-learning approach for  protein fitness prediction in extreme low-data settings. Critically, Metalic leverages a meta-training phase over a distribution of related fitness prediction tasks to learn how to utilize in-context sequences with protein language models (PLMs) and generalize effectively to new fitness prediction tasks. Along with fine-tuning at inference time, Metalic achieves strong performance in protein fitness prediction benchmarks, setting a new SOTA on ProteinGym, with significantly fewer parameters than baselines. Importantly, Metalic demonstrates the ability to make use of in-context learning for zero-shot tasks, further enhancing its applicability to scenarios with minimal labeled data.

Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates 

Pre-trained protein language models (PLMs) are synonymous with protein sequence design. However, Bayesian optimization (BO) methods require a measure of uncertainty to efficiently explore the space of designs. This paper goes back to basics and explores the use of Gaussian processes (GPs) as surrogate models for BO, which provide a principled measure of uncertainty, and can efficiently incorporate new experimental data via Bayesian updates. We empirically demonstrate that GPs with string and fingerprint kernels are competitive with PLMs in the multi-round sequence design benchmark ProteinGym, while requiring fewer computational resources, making it a competitive and resource-efficient alternative when working in low-n data settings.

Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean

In follow-up work, we demonstrate improved performance by augmenting the GP surrogate with the strong zero-shot PLM predictions as a prior mean function. We explore methods to learn a linear combination of the zero-shot PLM and a constant mean function, such that the GP surrogate can regulate the effects of a PLM guided prior.


>> Follow us on X for more #NeurIPS2024 updates, and if you’re ready to be part of our AI research journey, join us at instadeep.com/careers.