Learning and Planning in Complex Action Spaces

2021-04-13Unverified0· sign in to hype

Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver

Unverified — Be the first to reproduce this paper.

Abstract

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

Tasks

continuous-control Continuous Control Game of Go

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
acrobot.swingup	SMuZero	Return	417.52	—	Unverified
ball_in_cup.catch	SMuZero	Return	977.38	—	Unverified
cartpole.balance	SMuZero	Return	984.86	—	Unverified
cartpole.balance_sparse	SMuZero	Return	998.14	—	Unverified
cartpole.swingup	SMuZero	Return	868.87	—	Unverified
cartpole.swingup_sparse	SMuZero	Return	846.91	—	Unverified
cheetah.run	SMuZero	Return	914.39	—	Unverified
finger.spin	SMuZero	Return	986.38	—	Unverified
finger.turn_easy	SMuZero	Return	972.53	—	Unverified
finger.turn_hard	SMuZero	Return	963.07	—	Unverified
hopper.hop	SMuZero	Return	528.24	—	Unverified
hopper.stand	SMuZero	Return	926.5	—	Unverified
pendulum.swingup	SMuZero	Return	837.76	—	Unverified
quadruped.run	SMuZero	Return	923.54	—	Unverified
quadruped.walk	SMuZero	Return	933.77	—	Unverified
reacher.easy	SMuZero	Return	982.26	—	Unverified
reacher.hard	SMuZero	Return	971.53	—	Unverified
walker.run	SMuZero	Return	931.06	—	Unverified
walker.stand	SMuZero	Return	987.79	—	Unverified
walker.walk	SMuZero	Return	975.46	—	Unverified

Learning and Planning in Complex Action Spaces

Abstract

Tasks

Benchmark Results

Reproductions