Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Hao Qin, Kwang-Sung Jun, Chicheng Zhang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/MjolnirT/Kullback-Leibler-Maillard-SamplingOfficialnone★ 1
Abstract
We study K-armed bandit problems where the reward distributions of the arms are all supported on the [0,1] interval. It has been a challenge to design regret-efficient randomized exploration algorithms in this setting. Maillard sampling maillard13apprentissage, an attractive alternative to Thompson sampling, has recently been shown to achieve competitive regret guarantees in the sub-Gaussian reward setting bian2022maillard while maintaining closed-form action probabilities, which is useful for offline policy evaluation. In this work, we propose the Kullback-Leibler Maillard Sampling (KL-MS) algorithm, a natural extension of Maillard sampling for achieving KL-style gap-dependent regret bound. We show that KL-MS enjoys the asymptotic optimality when the rewards are Bernoulli and has a worst-case regret bound of the form O(^*(1-^*) K T K + K T), where ^* is the expected reward of the optimal arm, and T is the time horizon length.