A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal, and Parameter-free
Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret. Specifically, our algorithm achieves dynamic regret O( , ^13T^23\) for a contextual bandit problem with T rounds, S switches and total variation in data distributions. Importantly, our algorithm is adaptive and does not need to know S or ahead of time, and can be implemented efficiently assuming access to an ERM oracle. Our results strictly improve the O( ^14T^34, ^15T^45\) bound of (Luo et al., 2018), and greatly generalize and improve the O(ST) result of (Auer et al, 2018) that holds only for the two-armed bandit problem without contextual information. The key novelty of our algorithm is to introduce replay phases, in which the algorithm acts according to its previous decisions for a certain amount of time in order to detect non-stationarity while maintaining a good balance between exploration and exploitation.