SOTAVerified

Revisit Policy Optimization in Matrix Form

2019-09-19Unverified0· sign in to hype

Sitao Luan, Xiao-Wen Chang, Doina Precup

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as V_ = (I - P_)^-1 r_, where P_ is the state transition matrix given policy and r_ is the reward signal given . What annoys us is that P_ and r_ are both mixed with , which means every time when we update , they will change together. In this paper, we leverage the notation from wang2007dual to disentangle and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem sutton2018reinforcement and TRPO schulman2015trust can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.

Tasks

Reproductions