Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation
Pihe Hu, Yu Chen, Longbo Huang
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping (s,a). Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB^+, which achieves an O(HdT) regret bound where H is the episode length, d is the feature dimension, and T is the number of steps. LSVI-UCB^+ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. This is a minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the Hd gap between the upper bound of O(H^3d^3T) in (Jin et al., 2020) and lower bound of (HdT) for linear MDPs.