SOTAVerified

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

2021-02-12Unverified0· sign in to hype

Gen Li, Changxiao Cai, Yuxin Chen, Yuting Wei, Yuejie Chi

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a -discounted infinite-horizon MDP with state space S and action space A: to yield an entrywise -approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of |S||A|(1-)^5^2, which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when |A|=1 (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as |S|(1-)^3^2 (up to log factor); (2) when |A| 2, we settle the sample complexity of Q-learning to be on the order of |S||A|(1-)^4^2 (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when |A| 2, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be 1(1-)^4.

Tasks

Reproductions