The Extended UCB Policies for Frequentist Multi-armed Bandit Problems

2011-12-08Unverified0· sign in to hype

Keqin Liu, Tianshuo Zheng, Haoran Chen

Unverified — Be the first to reproduce this paper.

Abstract

The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [21]. The previous UCB policies require the knowledge of an upper bound on specific moments of reward distributions or a particular moment to exist, which can be hard to acquire or guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders p=4 and q=2) to arbitrarily chosen p and q as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order O(log T), thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions.

Tasks

Decision Making Sequential Decision Making

The Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Abstract

Tasks

Reproductions