Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization

2021-02-15Unverified0· sign in to hype

Aadirupa Saha, Nagarajan Natarajan, Praneeth Netrapalli, Prateek Jain

Unverified — Be the first to reproduce this paper.

Abstract

We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions _t admit a "pseudo-1d" structure, i.e. _t() = _t(_t()) where the output of _t is one-dimensional. At each round, the learner observes context _t, plays prediction _t(_t; _t) (e.g. _t()= _t, ) for some _t R^d and observes loss _t(_t(_t)) where _t is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem ( ) arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a lower bound of (dT, T^3/4) for the regret of any algorithm, where T is the number of rounds. We propose a new algorithm that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the optimal regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to O((d^9.5T,dT^3/4)) regret, that is significantly suboptimal in d.

Tasks

Decision Making

Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization

Abstract

Tasks

Reproductions