SOTAVerified

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

2024-05-29Unverified0· sign in to hype

Avik Kar, Rahul Singh

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We study Lipschitz MDPs in the infinite-horizon average-reward reinforcement learning (RL) setup in which an agent can play policies from a given set . The proposed algorithms ``zoom'' into ``promising'' regions of the policy space, thereby achieving adaptivity gains. We upper bound their regret as O(T^1 - d_eff.^-1), where d_eff. = d^_z+2 for model-free algorithm~PZRL-MF and d_eff. = 2d_S + d^_z + 3 for model-based algorithm~PZRL-MB. Here, d_S is the dimension of the state space, and d^_z is the zooming dimension. d^_z is a problem-dependent quantity that depends not only on the underlying MDP, but also on the class . This yields us a low regret in case the agent competes against a low-complexity (that has a small d^_z). We note that the preexisting notions of zooming dimension are inept at handling the non-episodic RL and do not yield adaptivity gains. The current work shows how to capture adaptivity gains for infinite-horizon average-reward RL in terms of d^_z. When specialized to the case of finite-dimensional policy space, we obtain that d_eff. scales as the dimension of this space under mild technical conditions; and also obtain d_eff. = 0, or equivalently O(T) regret for PZRL-MF, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature. Simulation experiments validate the gains arising due to adaptivity.

Tasks

Reproductions