Oryx: InstaDeep’s scalable sequence model for multi-agent coordination in offline settings

Oryx InstaDeep’s scalable sequence model for multi-agent coordination in offline settings

Published

Categories

Multi-agent reinforcement learning (MARL) holds significant promise across domains such as autonomous driving, warehouse logistics, intelligent rail networks, and satellite alignment. Yet deploying MARL in the real world remains difficult. Training typically requires vast amounts of interactive data, which is both costly and potentially risky, particularly in safety-critical settings where trial and error is not an option.

Fortunately, many of these domains already generate large static datasets, from historical traffic records to robot navigation trajectories and train schedules. Offline MARL aims to unlock this potential by training multi-agent policies solely from pre-collected data, but this presents its own set of challenges. Extrapolation error arises when agents attempt actions outside the dataset distribution, an issue that compounds as the joint action space grows with the number of agents. Miscoordination also becomes a risk: without active interaction, agents may learn incompatible strategies from suboptimal logged behaviour, degrading cooperation and dissolving into chaos. 

Existing methods have made progress on these issues, but most fail to scale beyond relatively small agent groups or when temporal memory is required. This highlights the need for a truly scalable alternative. Oryx is InstaDeep’s answer to this challenge: a new sequential model for offline MARL designed to achieve effective, multi-step coordination among many agents in complex environments.

Introducing Oryx

Oryx adapts InstaDeep’s Sable retention-based architecture, which provides the sequential backbone of the model, and integrates it with the loss function from implicit constraint Q-learning (ICQ) 1. By combining Sable’s autoregressive structure with ICQ’s objective, Oryx introduces a sequential variant of ICQ that enables stable, offline policy updates across long trajectories.

Put simply, Oryx equips agents with memory to capture long-horizon dependencies, restraint to avoid unseen actions, and autoregressive policies to coordinate decisions one after another. Its dual-decoder architecture outputs both policy predictions and state-action value estimates simultaneously, enabling Oryx to evaluate the relative benefit of alternative actions. This counterfactual reasoning underpins stable, extrapolation-safe updates during training.

Crucially, each agent selects its action in sequence, conditioned on the choices already made by others. This sequential mechanism stabilises learning and enforces coordination across long horizons and large groups.

Figure 1: Oryx’s model architecture. The green blocks indicate the inputs to the model (in yellow), sourced from the dataset of online experiences (in blue). First, a sequence of agent observations from timestep t to t+k is passed through the encoder. Inside each retention block, the network performs joint reasoning over the agents (a1, . . . , an) and temporal context (t, . . . , t + k), producing encoded representations at each timestep. These encoded observations, along with the actions from the dataset, are passed to the decoder, which has two heads. One head returns Q-values, while the second returns a policy distribution for each agent for the full sequence.
Figure 1: Oryx’s model architecture. The green blocks indicate the inputs to the model (in yellow), sourced from the dataset of online experiences (in blue). First, a sequence of agent observations from timestep t to t+k is passed through the encoder. Inside each retention block, the network performs joint reasoning over the agents (a1, . . . , an) and temporal context (t, . . . , t + k), producing encoded representations at each timestep. These encoded observations, along with the actions from the dataset, are passed to the decoder, which has two heads. One head returns Q-values, while the second returns a policy distribution for each agent for the full sequence.

Results

Oryx was evaluated across a broad suite of benchmarks spanning both discrete and continuous control, with tasks varying in scale and difficulty. Across all domains, the algorithm achieved state-of-the-art performance on more than 80% of 65 datasets, demonstrating strong generalisation in settings characterised by long horizons and high agent density.

In controlled environments such as T-Maze, which isolates the need for temporal memory and joint decision-making, Oryx consistently reached near-optimal performance. Baseline ICQ variants failed in these scenarios, and ablation studies confirmed that each of Oryx’s components was individually critical to success.

Performance advantages were even clearer in more demanding settings such as Connector, where coordination complexity grows rapidly with agent density. Oryx maintained near-expert performance with 30–50 agents, while competing approaches such as Multi-Agent ICQ degraded sharply under the same conditions.

Oryx also set new standards across established offline MARL benchmarks. In the StarCraft Multi-Agent Challenge (SMAC), a widely recognised environment featuring diverse cooperative, discrete-control tasks with dense reward signals and short episodes, Oryx outperformed previous state-of-the-art methods on 34 of 43 datasets spanning nine scenarios.

In Multi-Agent MuJoCo (MAMuJoCo), which evaluates continuous-control coordination across multiple configurations, Oryx matched or surpassed the best published performance on 14 of 16 datasets, highlighting its robustness across continuous domains.

Finally, in RWARE, a long-horizon, sparse-reward benchmark where agents must learn to cover a warehouse efficiently, Oryx achieved new highest scores across all six publicly available datasets, in some cases improving upon the previous state of the art by nearly 20%.

To encourage further progress, all newly created datasets, including those with up to 50 agents are publicly available.

Figure 3: Evaluating Oryx on Connector with a varying number of agents. Oryx achieves stable performance even with 50 agents.
Figure 2: Evaluating Oryx on Connector with a varying number of agents. Oryx achieves stable performance even with 50 agents.

What’s next?

Oryx represents a significant step forward in offline MARL. By unifying retention-based sequence modelling with extrapolation-safe policy updates, it addresses the two central challenges of extrapolation error and miscoordination. With extensive benchmarks, newly released datasets, and demonstrated scalability to 50-agent coordination, Oryx provides a strong foundation for moving MARL beyond simulation toward impactful real-world applications.

🚀We are thrilled to announce that Oryx will be presented at NeurIPS 2025, one of the world’s leading conferences in machine learning. Dive into the research and explore the model on GitHub


Disclaimer:  All claims made are supported by our research paper: Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL unless explicitly cited otherwise.

1 Yang, X., Ma, C., Li, C., Zheng, Z., Zhang, Q., Huang, G., Yang, J., & Zhao, Q. (2021). Believe what you see: Implicit constraint approach for offline multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 34. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2021/file/550a141f12de6341fba65b0ad0433500-Paper.pdf  ↩︎