Sublinear Regret for Learning POMDPs

2021-07-08Unverified0· sign in to hype

Yi Xiong, Ningyuan Chen, Xuefeng Gao, Xiang Zhou

Unverified — Be the first to reproduce this paper.

Abstract

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of O(T^2/3 T) for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Tasks

reinforcement-learning Reinforcement Learning (RL)

Sublinear Regret for Learning POMDPs

Abstract

Tasks

Reproductions