Tight Gap-Dependent Memory-Regret Trade-Off for Single-Pass Streaming Stochastic Multi-Armed Bandits
Zichun Ye, Chihao Zhang, Jiahao Zhao
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We study the problem of minimizing gap-dependent regret for single-pass streaming stochastic multi-armed bandits (MAB). In this problem, the n arms are present in a stream, and at most m<n arms and their statistics can be stored in the memory. We establish tight non-asymptotic regret bounds regarding all relevant parameters, including the number of arms n, the memory size m, the number of rounds T and (_i)_i [n] where _i is the reward mean gap between the best arm and the i-th arm. These gaps are not known in advance by the player. Specifically, for any constant 1, we present two algorithms: one applicable for m 23n with regret at most O_((n-m)T^1 + 1n^1 + 1 + 1_i:_i > 0_i^1 - 2) and another applicable for m<23n with regret at most O_(T^1+1m^1+1_i:_i > 0_i^1 - 2). We also prove matching lower bounds for both cases by showing that for any constant 1 and any m k < n, there exists a set of hard instances on which the regret of any algorithm is _((k-m+1) T^1+1k^1 + 1+1 _i:_i > 0_i^1-2). This is the first tight gap-dependent regret bound for streaming MAB. Prior to our work, an O(_i>0 T T_i) upper bound for the special case of =1 and m=O(1) was established by Agarwal, Khanna and Patil (COLT'22). In contrast, our results provide the correct order of regret as (1m_i>0T_i).