A Diffusion Analysis of Policy Gradient for Stochastic Bandits

2026-03-10Unverified0· sign in to hype

Tor Lattimore

Unverified — Be the first to reproduce this paper.

Abstract

We study a continuous-time diffusion approximation of policy gradient for k-armed stochastic bandits. We prove that with a learning rate η= O(Δ^2/(n)) the regret is O(k (k) (n) / η) where n is the horizon and Δ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless η= O(Δ^2).

A Diffusion Analysis of Policy Gradient for Stochastic Bandits

Abstract

Reproductions