Sample-efficient and Scalable Exploration in Continuous-Time RL
COMBRL is a continuous-time model-based reinforcement learning algorithm for efficient exploration under unknown nonlinear dynamics. It combines task reward maximization with epistemic uncertainty-driven exploration in a simple optimistic objective, yielding a scalable approach that supports both reward-driven and unsupervised settings.
Authors:
Klemens Iten,
Lenart Treven,
Bhavya Sukhija,
Florian Dörfler,
Andreas Krause
Venue:
ICLR 2026
📄 Paper | 💻 Code | 🗣 Conference Virtual Page | 🪧 Poster
Overview
Most reinforcement learning methods are designed for discrete-time dynamics, even though many real-world control systems are naturally continuous in time. In this work, we study continuous-time reinforcement learning, where the unknown system dynamics are modeled by nonlinear ordinary differential equations. We consider systems of the form
\[ \dot{\mathbf{x}}(t) = \mathbf{f}^*(\mathbf{x}(t), \mathbf{u}(t)), \qquad \mathbf{u}(t) = \pi(\mathbf{x}(t)), \]
with task objective
\[ J(\pi, \mathbf{f}^*) = \int_0^T r\big(\mathbf{x}_t, \pi(\mathbf{x}_t)\big)\,dt. \]
The goal is to learn a policy that both solves the task and collects informative data for learning an accurate dynamics model.
Method
Continuous-time RL. We consider control systems that evolve continuously according to \(\dot{\mathbf{x}}(t) = \mathbf{f}^*(\mathbf{x}(t), \mathbf{u}(t))\), where a policy \(\pi\) specifies the control input \(\mathbf{u}(t) = \pi(\mathbf{x}(t))\). This formulation naturally captures physical systems such as robots or dynamical processes.
Model-based RL loop. COMBRL follows a model-based loop:
- fit an uncertainty-aware dynamics model \((\boldsymbol{\mu}_n, \boldsymbol{\sigma}_n)\) from data,
- plan a policy \(\pi_n\) using the model,
- execute \(\pi_n\) to collect a trajectory \(\tau_n\),
- update the model with new (noisy, irregularly sampled) measurements.
A key challenge in continuous time is that data is collected along trajectories at irregular time points, making both learning and exploration more difficult.
Optimistic exploration. A principled strategy is to act optimistically: choose a policy that performs well under the most favorable dynamics consistent with the data. This leads to the idealized objective
\[ \max_{\pi} \; \max_{\mathbf{f} \in \mathcal{M}_n} \; J(\pi, \mathbf{f}), \]
where \(\mathcal{M}_n\) is the set of plausible dynamics models. However, this co-optimization over both policies and dynamics is computationally intractable in practice.
COMBRL. Instead, COMBRL plans with respect to the mean model \(\boldsymbol{\mu}_n\) and adds an exploration bonus based on the epistemic uncertainty of the dynamics model \(\boldsymbol{\sigma}_n\), yielding a simple scalar objective that approximates optimistic exploration.
In episode \(n\), COMBRL plans under
\[ \dot{\mathbf{x}}'(t) = \boldsymbol{\mu}_n\big(\mathbf{x}'(t), \mathbf{u}(t)\big) \]
and computes a policy via
\[ \pi_n = \arg\max_{\pi \in \Pi} \int_0^T r(\mathbf{x}'_t, \mathbf{u}_t) + \lambda_n \, \|\boldsymbol{\sigma}_{n-1}(\mathbf{x}'_t, \mathbf{u}_t)\| \, dt. \]
Here, \(\lambda_n\) controls the exploration–exploitation trade-off, yielding:
- \(\lambda_n = 0\): reward-only (greedy),
- \(0 < \lambda_n < \infty\): task-driven exploration,
- \(\lambda_n \to \infty\): unsupervised exploration.
Main Results
On the theory side, we show sublinear regret in the reward-driven setting and provide a sample complexity guarantee in the unsupervised setting.
Empirically, COMBRL:
- scales better than prior continuous-time exploration methods,
- is more sample-efficient across several deep RL tasks,
- handles sparse-reward and underactuated tasks robustly,
- generalizes well to previously unseen downstream tasks, especially in the unsupervised setting,
- and enables efficient time-adaptive control with fewer interactions.
Scalability and robustness. COMBRL scales better than prior methods and performs strongly even in sparse-reward and underactuated settings.
Deep RL benchmarks. Across several deep RL tasks, COMBRL achieves strong sample efficiency and outperforms prior baselines.
Generalization. In the unsupervised setting, COMBRL learns exploratory behaviors that transfer well to previously unseen downstream tasks.
Time-adaptive control. COMBRL also enables efficient time-adaptive control and achieves strong performance with fewer interactions.
Why COMBRL?
COMBRL is designed to make optimistic exploration in continuous-time RL practical. It combines a simple planning objective with uncertainty-aware dynamics models such as Gaussian processes, Bayesian neural networks, or probabilistic ensembles. This makes it both theoretically grounded and scalable to challenging control problems.
Presentation
The paper was presented at the Pre-ICLR Poster Session at the ETH AI Center and will be presented at ICLR 2026 in Rio.
Poster session:
Pavilion 4, P4-#4706
Friday, April 24, 2026
10:30 AM – 1:00 PM
Links
Paper:
arXiv:2510.24482
Code:
github.com/lasgroup/ombrl
Project Website:
go.klem.nz/combrl