Posterior sampling for reinforcement learning: worst-case regret bounds

2017-05-19Unverified0· sign in to hype

Shipra Agrawal, Randy Jia

Unverified — Be the first to reproduce this paper.

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of O(DSAT) for any communicating MDP with S states, A actions and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon T. This result closely matches the known lower bound of (DSAT). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Tasks

reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)Thompson Sampling

Posterior sampling for reinforcement learning: worst-case regret bounds

Abstract

Tasks

Reproductions