From our beginnings to becoming a key player in AI, supporting new research spaces and sharing what we’ve learnt in AI has always been at the core of InstaDeep’s DNA.
This year, our journey brings us to the Vancouver Convention Center, Canada for the 38th annual Neural Information Processing Systems (NeurIPS) conference from December 10 – 15, 2024. As one of the premier gatherings for Machine Learning and AI research experts from around the globe, NeurIPS provides a platform for exchanging ideas and showcasing innovation within the AI community.
As the conference is set to start, our team is gearing up to present their latest research with three main track papers and five papers in workshops focused on AI for Life Sciences and Decision-Making. As we join the world’s leading minds in ML, this year’s research showcases our collaboration and innovation, particularly in AI and biology.
Here’s a quick overview of what we’re bringing to NeurIPS 2024!
Main Track Papers
SPO: Sequential Policy Optimisation
Sequential Monte Carlo Policy Optimization (SPO) is a model-based RL algorithm that uses sample-based search for policy improvement, overcoming scalability issues common in tree-based methods like MCTS and Alpha Zero. SPO leverages the Expectation-Maximization framework to deliver strong performance across both continuous and discrete tasks, outperforming baselines. Its parallelizable design allows it to efficiently scale on hardware accelerators, ensuring faster training and inference times while maintaining robust policy improvements.
- Read the full paper here: https://openreview.net/pdf?id=XKvYcPPH5G
- Slides/Talk: https://neurips.cc/virtual/2024/poster/94776
- Poster: #94776 | Wed 11 Dec 11 a.m. — 2 p.m. PST
Multi-modal Transfer Learning between Biological Foundation Models
We introduce IsoFormer, a multi-modal model that bridges DNA, RNA, and protein data using pre-trained encoders to address complex genomics tasks. Applied to predict differential RNA transcript expression across human tissues, IsoFormer surpasses existing methods by effectively transferring knowledge across modalities. This approach opens new possibilities for multi-modal gene expression modeling in computational biology.
- Read the full paper here: https://arxiv.org/abs/2406.14150v1
- Code link: https://huggingface.co/InstaDeepAI/isoformer
- Poster: #93095📍Poster session 3 | Thursday 12 December 8:00 p.m. PST
Dispelling the Mirage of Progress in Offline MARL
Offline multi-agent reinforcement learning (MARL) shows great potential, but inconsistencies in current baselines and evaluation protocols limit progress and make it hard to track real advancements. We identify key weaknesses in existing evaluation methods and show that well-implemented, straightforward baselines can often match or outperform state-of-the-art algorithms across 75% of tested datasets. To address these gaps, we provide a standardized evaluation method and baseline implementations, setting a clearer benchmark for future research in offline MARL.
- Read the full paper here: https://arxiv.org/abs/2406.09068
- Code link: https://github.com/instadeepai/og-marl
- Poster: #97812 | Wednesday 11 December 4:30 p.m. PST
Workshop Papers
Learning the Language of Protein Structures
Protein structure modeling is challenging due to its continuous and three-dimensional nature. To address this, we propose a vector-quantized autoencoder that converts protein structures into discrete tokens, simplifying the complex structure space into a discrete, manageable format. Using a codebook of 4,096 to 64,000 tokens, our model achieves highly accurate reconstructions, with backbone RMSDs of 1-5 Å. Additionally, we show that a GPT model trained on these tokenized representations can generate novel, diverse, and designable protein structures. This approach not only captures protein structures effectively but also bridges gaps between different data modalities, advancing computational methods in protein design.
- Read the full paper here: https://arxiv.org/abs/2405.15840
- Code link: https://github.com/instadeepai/protein-structure-tokenizer/
- Poster: #11 & 12 📍East MTG Rooms | Sunday 15 December
BoostMD – Accelerating MD with MLIP
Modeling atomic-scale processes in biology and materials science is often limited by slow simulation speeds. BoostMD, a new machine learning architecture, addresses this by using node features from prior steps to predict forces and energies, enabling a smaller, faster model between reference model evaluations. BoostMD achieves up to 8x faster speeds while maintaining accuracy on unseen dipeptides, making it a reliable tool for long-timescale molecular simulations.
- Workshop: Data-driven and Differentiable Simulations, Surrogates, and Solvers (D3S3) | Machine Learning for Structural Biology (MLSB)
- Read the full paper here: https://openreview.net/pdf?id=H0USH61HnF
- Poster:
- #D3S3📍Meeting 116-117 | Sunday 15 December
- MLSB 📍MTG Rooms 11 & 12 | Sunday 15 December
Metalic: Meta-Learning In-Context with Protein Large Language Models
We introduce Metalic, an in-context meta-learning approach for protein fitness prediction in extreme low-data settings. Critically, Metalic leverages a meta-training phase over a distribution of related fitness prediction tasks to learn how to utilize in-context sequences with protein language models (PLMs) and generalize effectively to new fitness prediction tasks. Along with fine-tuning at inference time, Metalic achieves strong performance in protein fitness prediction benchmarks, setting a new SOTA on ProteinGym, with significantly fewer parameters than baselines. Importantly, Metalic demonstrates the ability to make use of in-context learning for zero-shot tasks, further enhancing its applicability to scenarios with minimal labeled data.
- Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges
- Read the full paper here: https://openreview.net/pdf?id=jQyFXpFmEP
- Code link: https://github.com/instadeepai/metalic
- Poster: #26📍West Meeting Room 202-204 | Sunday 15 December 3:00 — 3:30 p.m. PST
Bayesian Optimisation for Protein Sequence Design: Back to Basics with Gaussian Process Surrogates
Pre-trained protein language models (PLMs) are synonymous with protein sequence design. However, Bayesian optimization (BO) methods require a measure of uncertainty to efficiently explore the space of designs. This paper goes back to basics and explores the use of Gaussian processes (GPs) as surrogate models for BO, which provide a principled measure of uncertainty, and can efficiently incorporate new experimental data via Bayesian updates. We empirically demonstrate that GPs with string and fingerprint kernels are competitive with PLMs in the multi-round sequence design benchmark ProteinGym, while requiring fewer computational resources, making it a competitive and resource-efficient alternative when working in low-n data settings.
- Workshop: AI for Accelerated Materials Design
- Read the full paper here: https://openreview.net/pdf?id=zori4pHmFP
- Poster: #211-214📍West Meeting Room | Saturday 14 December 8:15 a.m. — 5:30 p.m. PST
Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean
In follow-up work, we demonstrate improved performance by augmenting the GP surrogate with the strong zero-shot PLM predictions as a prior mean function. We explore methods to learn a linear combination of the zero-shot PLM and a constant mean function, such that the GP surrogate can regulate the effects of a PLM guided prior.
- Workshop: Machine Learning for Structural Biology
- Poster: #11 & 12📍 East Meeting Rooms | Sunday 15 December 8:15 a.m. — 5:30 p.m. PST
>> Follow us on X for more #NeurIPS2024 updates, and if you’re ready to be part of our AI research journey, join us at instadeep.com/careers.