Near-optimal Reinforcement Learning in Factored MDPs

2014-03-15NeurIPS 2014Unverified0· sign in to hype

Ian Osband, Benjamin Van Roy

Unverified — Be the first to reproduce this paper.

Abstract

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer (SAT) regret on some MDP, where T is the elapsed time and S and A are the cardinalities of the state and action spaces. This implies T = (SA) time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, S and A can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a factored MDP, it is possible to achieve regret that scales polynomially in the number of parameters encoding the factored MDP, which may be exponentially smaller than S or A. We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

Tasks

reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)

Near-optimal Reinforcement Learning in Factored MDPs

Abstract

Tasks

Reproductions