SOTAVerified

Statistical Efficiency of Distributional Temporal Difference Learning and Freedman's Inequality in Hilbert Spaces

2024-03-09Unverified0· sign in to hype

Yang Peng, Liangyu Zhang, Zhihua Zhang

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in DRL is distributional policy evaluation, which involves estimating the return distribution ^ for a given policy . Distributional temporal difference learning has been accordingly proposed, which extends the classic temporal difference learning (TD) in RL. In this paper, we focus on the non-asymptotic statistical rates of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD (NTD). For a -discounted infinite-horizon tabular Markov decision process, we show that for NTD with a generative model, we need O(^-2_^-1(1-)^-3) interactions with the environment to achieve an -optimal estimator with high probability, when the estimation error is measured by the 1-Wasserstein. This sample complexity bound is minimax optimal up to logarithmic factors. In addition, we revisit categorical distributional TD (CTD), showing that the same non-asymptotic convergence bounds hold for CTD in the case of the 1-Wasserstein distance. We also extend our analysis to the more general setting where the data generating process is Markovian. In the Markovian setting, we propose variance-reduced variants of NTD and CTD, and show that both can achieve a O(^-2 _,^-1(1-)^-3+t_mix_,^-1(1-)^-1) sample complexity bounds in the case of the 1-Wasserstein distance, which matches the state-of-the-art statistical results for classic policy evaluation. To achieve the sharp statistical rates, we establish a novel Freedman's inequality in Hilbert spaces. This new Freedman's inequality would be of independent interest for statistical analysis of various infinite-dimensional online learning problems.

Tasks

Reproductions