Cramming Contextual Bandits for On-policy Statistical Evaluation

2024-03-11Unverified0· sign in to hype

Zeyang Jia, Kosuke Imai, Michael Lingzhi Li

Unverified — Be the first to reproduce this paper.

Abstract

We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.

Tasks

Multi-Armed Bandits Off-policy evaluation Thompson Sampling valid

Cramming Contextual Bandits for On-policy Statistical Evaluation

Abstract

Tasks

Reproductions