SOTAVerified

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

2020-06-18Unverified0· sign in to hype

Qiang He, Xinwen Hou

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm Weighted Delayed Deep Deterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithmsOur code is available at~https://sites.google.com/view/ictai20-wd3/..

Tasks

Reproductions