SOTAVerified

Continuous K-Max Bandits

2025-02-19Unverified0· sign in to hype

Yu Chen, Siwei Wang, Longbo Huang, Wei Chen

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We study the K-Max combinatorial multi-armed bandits problem with continuous outcome distributions and weak value-index feedback: each base arm has an unknown continuous outcome distribution, and in each round the learning agent selects K arms, obtains the maximum value sampled from these K arms as reward and observes this reward together with the corresponding arm index as feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. The continuous K-Max bandits introduce unique challenges, including discretization error from continuous-to-discrete conversion, non-deterministic tie-breaking under limited feedback, and biased estimation due to partial observability. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds to tackle these challenges. For general continuous distributions, we prove that DCK-UCB achieves a O(T^3/4) regret upper bound, establishing the first sublinear regret guarantee for this setting. Furthermore, we identify an important special case with exponential distributions under full-bandit feedback. In this case, our proposed algorithm MLE-Exp enables O(T) regret upper bound through maximal log-likelihood estimation, achieving near-minimax optimality.

Tasks

Reproductions