A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

2020-06-08Unverified0· sign in to hype

Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, Haipeng Luo

Unverified — Be the first to reproduce this paper.

Abstract

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of O(T) for the general class of weakly communicating MDPs, where T is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves O( T) regret without the ergodic assumption, and matches the lower bound in terms of T except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

Tasks

Q-Learning reinforcement-learning Reinforcement Learning (RL)

A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Abstract

Tasks

Reproductions