An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of O(T^23(K(||))^13) and makes at most O(K) calls per round to an offline optimization oracle, where K denotes the number of actions, T denotes the number of rounds and denotes the set of policies. This is the first result to improve the prior best bound of O((TK)^23((||))^13) as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.