How Does Variance Shape the Regret in Contextual Bandits?
Zeyu Jia, Jian Qian, Alexander Rakhlin, Chen-Yu Wei
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax bounds, we show that the eluder dimension d_elu-a complexity measure of the function class-plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of ( ,d_elu\+d_elu) is unavoidable when d_eluAT, where A is the number of actions, T is the total number of rounds, and is the total variance over T rounds. For the A d_elu regime, we derive a nearly matching upper bound O(A+d_elu) for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of (d_elu+d_elu) is unavoidable when d_elu+d_eluAT. In this setting, we provide an upper bound of order O(d_elu+d_elu). Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound O(d_elu+d_elu) established in their work is unimprovable when d_elu+d_eluAT. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of O(A+d_elu).