Kernel-Based Reinforcement Learning: A Finite-Time Analysis
Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/omardrwch/kernel_ucbvi_experimentsOfficialIn papernone★ 3
Abstract
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with K episodes and horizon H, we provide a regret bound of O( H^3 K^2d2d+1), where d is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and has been previously applied to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.