Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation
Woojin Chae, Dabeen Lee
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of O(d^3/2sp(v^*)T) over T time steps where sp(v^*) is the span of the optimal bias function v^* and d is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of O(dsp(v^*)T). The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.