A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
Krishna C. Kalagarla, Rahul Jain, Pierluigi Nuzzo
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an -optimal policy, i.e., with resulting objective value within of the optimal value and satisfying the constraints within -tolerance, with probability at least 1-. The number of episodes needed is shown to be of the order O(|S||A|C^2H^2^21), where C is the upper bound on the number of possible successor states for a state-action pair. Therefore, if C |S|, the number of episodes needed have a linear dependence on the state and action space sizes |S| and |A|, respectively, and quadratic dependence on the time horizon H.