Finite Sample Analysis of Average-Reward TD Learning and Q-Learning

2021-12-01NeurIPS 2021Unverified0· sign in to hype

Sheng Zhang, Zhe Zhang, Siva Theja Maguluri

Unverified — Be the first to reproduce this paper.

Abstract

The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD() with linear function approximation for policy evaluation and (ii) average-reward Q-learning in the tabular setting to find the optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD() by working in an appropriately defined subspace that ensures uniqueness of the solution. For Q-learning, we exploit the span seminorm contractive property of the Bellman operator, and construct a novel Lyapunov function obtained by infimal convolution of a generalized Moreau envelope and the indicator function of a set.

Tasks

Q-Learning

Finite Sample Analysis of Average-Reward TD Learning and Q-Learning

Abstract

Tasks

Reproductions