Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits
Chen Wang
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/jhwjhw0123/streaming-regret-minimization-mabsOfficialIn papernone★ 0
Abstract
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively in recent years. In the single-pass setting with K arms and T trials, a regret lower bound of (T^2/3) has been proved for any algorithm with o(K) memory (Maiti et al. [NeurIPS'21]; Agarwal at al. [COLT'22]). On the other hand, however, the previous best regret upper bound is still O(K^1/3 T^2/3^1/3(T)), which is achieved by the streaming implementation of the simple uniform exploration. The O(K^1/3^1/3(T)) gap leaves the open question of the tight regret bound in the single-pass MABs with sublinear arm memory. In this paper, we answer this open problem and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to (K^1/3T^2/3) for algorithms with o(K) memory, which matches the uniform exploration regret up to a logarithm factor in T. We then show that the ^1/3(T) factor is not necessary, and we can achieve O(K^1/3T^2/3) regret by finding an -best arm and committing to it in the rest of the trials. For regret minimization with high constant probability, we can apply the single-memory -best arm algorithms in Jin et al. [ICML'21] to obtain the optimal bound. Furthermore, for the expected regret minimization, we design an algorithm with a single-arm memory that achieves O(K^1/3 T^2/3(K)) regret, and an algorithm with O(^*(n))-memory with the optimal O(K^1/3 T^2/3) regret following the -best arm algorithm in Assadi and Wang [STOC'20]. We further tested the empirical performances of our algorithms. The simulation results show that the proposed algorithms consistently outperform the benchmark uniform exploration algorithm by a large margin, and on occasion, reduce the regret by up to 70%.