Breaking the Performance Ceiling in Reinforcement Learning

Breaking the performance ceiling in Reinforcement Learning

Published

Categories

Reinforcement learning (RL) has delivered some of AI’s most striking successes, from human-level Atari 1 play to world-class performance in Go2. Yet when applied to messy, real-world combinatorial optimisation (CO) problems such as energy grid management or autonomous logistics, even state-of-the-art RL systems can stall.

Despite being trained to convergence, policies often hit a performance ceiling: their zero-shot performance plateaus, meaning they fail to improve on tasks they haven’t been explicitly trained for, no matter how much additional data or compute is applied. This suggests that progress is not simply a matter of training harder or longer.

The prevailing focus on zero-shot performance is increasingly challenged as task complexity grows, and the gap between zero-shot behaviour and true optimality widens dramatically. In many cases, earlier evaluations mask this issue by relying on benchmarks where models already achieved over 95% zero-shot performance.

As a result, practitioners often spend months chasing marginal gains while overlooking an equally fundamental paradigm: inference-time search. Our work consolidates the largest inference study to date, comprising over 60,000 experiments, and demonstrates its potential to unlock significant improvements, showing that inference-time methods are not just optional refinements but core drivers of performance.

Inference strategies put to the test

Instead of relying on a single, instinctive action, what if AI could explore multiple candidate solutions before committing to the best one? This is inference-time search: a lightweight yet powerful way to improve decision-making during execution by giving AI time to think.

Real-world systems don’t always need to act instantly, and many applications allow a few seconds, hours, or even days of inference-time computation. We explored this concept through Decentralised Partially Observable Markov Decision Processes (Dec-POMDPs), a framework that captures the complexity of real-world CO problems, by modelling multi-agent systems in which each agent has only partial observability. We benchmarked three base RL architectures (IPPO, MAPPO and Sable) combined with four inference strategies designed to make the most of this additional thinking time, across 17 complex RL tasks drawn from environments including Connector, StarCraft Multi-Agent Challenge V2, and Robotic Warehouse.

Overview of our tasks and experimental study.
Figure 1: Overview of our tasks and experimental study. Source: internal

Stochastic Sampling
The simplest of the four, stochastic sampling improves solution quality by generating multiple options rather than always selecting the most likely one. It is easy to implement, robust, and surprisingly effective under small time budgets. However, because it relies on chance, its benefits plateau as compute scales, making it less effective for larger or more complex tasks.

Tree Search
Where stochastic sampling relies on randomness, tree search takes a more systematic route. It explores the most promising areas of the solution space by expanding candidate actions, simulating outcomes, and discarding weaker options. We tested a version known as: Simulation-Guided Beam Search (SGBS), which improves on traditional methods by striking a better balance between speed and accuracy. It performs strongly when compute is abundant but time is short, although its step-by-step, greedy nature limits exploration when more time is available.

Online Fine-Tuning
While tree search focuses on exploration, online fine-tuning instead adapts the model itself. It updates a pre-trained policy during inference, tailoring it to the specific problem instance being solved. This approach can lead to impressive results under large time or compute budgets. However, it can be unpredictable in practice: additional compute reduces the number of attempts, updates can become unstable, and the model may settle on suboptimal solutions. As a result, fine-tuning tends to require substantial compute resources to consistently outperform simpler methods.

Diversity-Based Search (COMPASS)
Finally, diversity-based search takes a different perspective altogether. Rather than adjusting a single model or exploring one path at a time, it works by pre-training a collection of diverse, specialised policies and, at inference time, using the available compute and time budget to identify the one best suited to the current task.

The leading example is InstaDeep’s COMPASS (Combinatorial Optimisation with Policy Adaptation using Latent Space Search). COMPASS conditions policies on a latent representation that captures a wide range of behaviours, then efficiently searches this space to find the optimal configuration. All agents share the same latent space, keeping the method efficient in multi-agent settings.

Numerous applications of RL
Figure 2: Numerous applications of RL involve two distinct phases: (1) a training phase, typically unconstrained in time and compute, during which a policy is optimised over a representative distribution of problem instances; and (2) an inference phase, where a limited time and compute budget are allocated to solving a new instance. The inference phase is often overlooked, despite its crucial role in complex tasks where partial observability and the combinatorial growth of observation and action spaces make good solutions unattainable through zero-shot execution alone. Source: internal

Across all evaluations, COMPASS consistently emerged as the strongest inference strategy. It scales smoothly with both compute and time, exploits diversity for parallel exploration, and continues to improve until near-perfect results are reached. In practice, it pushed SABLE from around 60% performance to over 95% on some of the hardest tasks.

Results
Across all 17 challenging RL environments, inference-time search delivered clear and consistent performance gains. On average, models achieved a 45% improvement over zero-shot state-of-the-art, and in the most difficult tasks, performance increased by up to 126%. Remarkably, these results were achieved with only 30 seconds of additional time.

Beyond the numbers, these findings highlight a shift in how progress in reinforcement learning can be achieved. For years, most effort has focused on pushing the limits of training, more data, more compute, more parameters. This study shows that major improvements can also be realised after training, through inference-time strategies that make better use of a model’s existing knowledge.

 Improvement from using inference-time search over zero-shot state-of-the-art. Across 17 complex reinforcement learning tasks, we obtain consistent and significant performance gains using only a 30 second search budget during execution.
Figure 3: Improvement from using inference-time search over zero-shot state-of-the-art. Across 17 complex reinforcement learning tasks, we obtain consistent and significant performance gains using only a 30 second search budget during execution. Source: internal

What’s next

These findings redefine how we think about progress in RL. Rather than endlessly scaling training, smarter deployment-time strategies can unlock dramatic improvements with minimal cost.

By treating inference as a structured exploration and reasoning process instead of a single forward pass, RL systems can bridge the gap between learned behaviour and optimal performance. Inference-time strategies, once considered optional, now stand out as integral components of modern decision-making pipelines.

Crucially, inference-time search allows us to move beyond accepting outcomes that are merely “good enough.” With only a micro adjustment to inference time, we can prioritise excellent solutions, achieving higher-quality results without retraining and limited added complexity.

The implications are far-reaching. Whether managing power grids, coordinating autonomous fleets, or even Printed Circuit Board design, giving AI a brief moment to think may prove more valuable than weeks of additional training.

🚀 We’re proud to share that this research has been selected among the top 0.3% of applicants for an oral presentation at NeurIPS 2025!

All results, including the more than 60,000 experiments across 17 complex reinforcement learning tasks, are publicly available. Read the full paper and explore the accompanying data and code


Disclaimer:  All claims made are supported by our research paper: Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies unless explicitly cited otherwise

  1. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015. ↩︎
  2. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ↩︎