SOTAVerified

Smooth Non-Stationary Bandits

2023-01-29Unverified0· sign in to hype

Su Jia, Qian Xie, Nathan Kallus, Peter I. Frazier

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change smoothly, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a -H\"older function, i.e., a function that is (-1)-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as increases. When =1, this corresponds to the non-smooth regime, where besbes2014stochastic established a minimax regret of (T^2/3). We show the first separation between the smooth (i.e., 2) and non-smooth (i.e., =1) regimes by presenting a policy with O(k^4/5 T^3/5) regret on any k-armed, 2-H\"older instance. We complement this result by showing that the minimax regret on the -H\"older family of instances is (T^(+1)/(2+1)) for any integer 1. This matches our upper bound for =2 up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.

Tasks

Reproductions