No-Regret Linear Bandits under Gap-Adjusted Misspecification
Chong Liu, Dan Qiao, Ming Yin, Ilija Bogunovic, Yu-Xiang Wang
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This work studies linear bandits under a new notion of gap-adjusted misspecification and is an extension of Liu et al. (2023). When the underlying reward function is not linear, existing linear bandits work usually relies on a uniform misspecification parameter that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever > 0. We propose a more natural model of misspecification which only requires the approximation error at each input x to be proportional to the suboptimality gap at x. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such -gap-adjusted misspecification with parameter diminishing at O(1/(d T)). It achieves a near-optimal O(T) regret for problems that the best-known regret is almost linear in time horizon T. We further advance this frontier by presenting a novel phased elimination-based algorithm whose gap-adjusted misspecification parameter = O(1/d) does not scale with T. This algorithm attains optimal O(T) regret and is deployment-efficient, requiring only T batches of exploration. It also enjoys an adaptive O( T) regret when a constant suboptimality gap exists. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself, and a new inductive lemma that limits the misspecification error within the suboptimality gap for all valid actions in each batch selected by G-optimal design.