Conservative Contextual Bandits: Beyond Linear Representations
Rohan Deb, Mohammad Ghavamzadeh, Arindam Banerjee
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent's policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than (1+) factor. Prior work developed UCB-style algorithms in the multi-armed [Wu et al., 2016] and contextual linear [Kazerouni et al., 2017] settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms C-SquareCB and C-FastCB, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied with high probability and that the regret of C-SquareCB is sub-linear in horizon T, while the regret of C-FastCB is first-order and is sub-linear in L^*, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide O(KT + K/) and O(KL^* + K (1 + 1/)) regret bounds, respectively. Finally, we demonstrate the efficacy of our algorithms on real-world data and show that they significantly outperform the existing baseline while maintaining the performance guarantee.