SPP-RL: State Planning Policy Reinforcement Learning

2021-09-29Unverified0· sign in to hype

Jacek Cyranka, Zuzanna Opała, Jacek Płocharczyk, Mikhail Zanka

Unverified — Be the first to reproduce this paper.

Abstract

We introduce an algorithm for reinforcement learning, in which the actor plans for the next state provided the current state. To communicate the actor output to the environment we incorporate an inverse dynamics control model and train it using supervised learning. We train the RL agent using off-policy state-of-the-art reinforcement learning algorithms: DDPG, TD3, and SAC. To guarantee that the target states are physically relevant, the overall learning procedure is formulated as a constrained optimization problem, solved via the classical Lagrangian optimization method. We benchmark the state planning RL approach using a varied set of continuous environments, including standard MuJoCo tasks, safety-gym level 0 environments, and AntPush. In SPP approach the optimal policy is being searched for in the space of state-state mappings, a considerably larger space than the traditional space of state-action mappings. We report that quite surprisingly SPP implementations attain superior performance to vanilla state-of-the-art off-policy RL algorithms in the tested environments.

Tasks

MuJoCo reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)

SPP-RL: State Planning Policy Reinforcement Learning

Abstract

Tasks

Reproductions