SOTAVerified

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

2020-10-01NeurIPS 2021Unverified0· sign in to hype

Jiafan He, Dongruo Zhou, Quanquan Gu

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-, which is based on the optimism in the face of uncertainty principle and the Bernstein-type bonus. We show that UCBVI- achieves an O(SAT/(1-)^1.5) regret, where S is the number of states, A is the number of actions, is the discount factor and T is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least (SAT/(1-)^1.5). Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI- is nearly minimax optimal for discounted MDPs.

Tasks

Reproductions