SOTAVerified

Q-learning with Logarithmic Regret

2020-06-16Unverified0· sign in to hype

Kunhe Yang, Lin F. Yang, Simon S. Du

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal Q-function. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a O(SA poly(H)_(SAT)) cumulative regret bound, where S is the number of states, A is the number of actions, H is the planning horizon, T is the total number of steps, and _ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of S,A,T up to a (SA) factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

Tasks

Reproductions