Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity
Zihan Zhang, Yuan Zhou, Xiangyang Ji
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In this paper we consider the problem of learning an -optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with S states, A actions, the discount factor (0,1), and an approximation threshold > 0, we provide a model-free algorithm to learn an -optimal policy with sample complexity O(SA(1/p)^2(1-)^5.5) (where the notation O() hides poly-logarithmic factors of S,A,1/(1-), and 1/) and success probability (1-p). For small enough , we show an improved algorithm with sample complexity O(SA(1/p)^2(1-)^3). While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on S, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.