SOTAVerified

Average Reward Reinforcement Learning with Monotonic Policy Improvement

2021-01-01Unverified0· sign in to hype

Yiming Zhang, Keith W. Ross

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In continuing control tasks, an agent’s average reward per time step is a more natural performance measure compared to the commonly used discounting framework as it can better capture an agent’s long-term behavior. We derive a novel lower bound on the difference of the average rewards for two policies, where the lower bound depends on the average divergence between the policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al.,2017) result in a trivial lower bound in the average reward setting. We develop an iterative procedure based on our lower bound which produces a sequence of monotonically improved policies for the average reward criterion. When combined with deep reinforcement learning methods, the procedure leads to scalable and efficient algorithms aimed at maximizing an agent’s average reward performance. Empirically, we demonstrate the efficacy of our algorithms through a series of high-dimensional control tasks with long time horizons and show that discounting can lead to unsatisfactory performance on continuing control tasks.

Tasks

Reproductions