Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of ((H^3SA/^2)(1/)) on the sample complexity of an (,)-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the (H^3SAT) regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.