Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

2020-06-06Unverified0· sign in to hype

Zihan Zhang, Yuan Zhou, Xiangyang Ji

Unverified — Be the first to reproduce this paper.

Abstract

In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor (0,1), and an approximation threshold > 0, we provide a model-free algorithm to learn an -optimal policy with sample complexity O(SA(1/p)^2(1-)^5.5) (where the notation O() hides poly-logarithmic factors of S,A,1/(1-), and 1/) and success probability (1-p). For small enough , we show an improved algorithm with sample complexity O(SA(1/p)^2(1-)^3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Tasks

reinforcement-learning Reinforcement Learning (RL)

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Abstract

Tasks

Reproductions