Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

2017-12-01NeurIPS 2017Unverified0· sign in to hype

Shipra Agrawal, Randy Jia

Unverified — Be the first to reproduce this paper.

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of O(DSAT) for any communicating MDP with S states, A actions and diameter D, when T S^5A. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon T. This result improves over the best previously known upper bound of O(DSAT) achieved by any algorithm in this setting, and matches the dependence on S in the established lower bound of (DSAT) for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Tasks

reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)Thompson Sampling

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

Abstract

Tasks

Reproductions