Policy Optimization with Stochastic Mirror Descent

2019-06-25Unverified0· sign in to hype

Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, Jun Wen, Gang Pan

Unverified — Be the first to reproduce this paper.

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(^-3) sample trajectories to achieve an -approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

Tasks

Continuous Control Policy Gradient Methods reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)

Policy Optimization with Stochastic Mirror Descent

Abstract

Tasks

Reproductions