SOTAVerified

Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

2024-02-09Code Available0· sign in to hype

Stefana Anita, Gabriel Turinici

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a L2 regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Tasks

Reproductions