SOTAVerified

Learning and Planning in Complex Action Spaces

2021-04-13Code Available0· sign in to hype

Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Many important real-world problems have action spaces that are high-dimensional, continuous or both, making full enumeration of all possible actions infeasible. Instead, only small subsets of actions can be sampled for the purpose of policy evaluation and improvement. In this paper, we propose a general framework to reason in a principled way about policy evaluation and improvement over such sampled action subsets. This sample-based policy iteration framework can in principle be applied to any reinforcement learning algorithm based upon policy iteration. Concretely, we propose Sampled MuZero, an extension of the MuZero algorithm that is able to learn in domains with arbitrarily complex action spaces by planning over sampled actions. We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: DeepMind Control Suite and Real-World RL Suite.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
acrobot.swingupSMuZeroReturn417.52Unverified
ball_in_cup.catchSMuZeroReturn977.38Unverified
cartpole.balanceSMuZeroReturn984.86Unverified
cartpole.balance_sparseSMuZeroReturn998.14Unverified
cartpole.swingupSMuZeroReturn868.87Unverified
cartpole.swingup_sparseSMuZeroReturn846.91Unverified
cheetah.runSMuZeroReturn914.39Unverified
finger.spinSMuZeroReturn986.38Unverified
finger.turn_easySMuZeroReturn972.53Unverified
finger.turn_hardSMuZeroReturn963.07Unverified
hopper.hopSMuZeroReturn528.24Unverified
hopper.standSMuZeroReturn926.5Unverified
pendulum.swingupSMuZeroReturn837.76Unverified
quadruped.runSMuZeroReturn923.54Unverified
quadruped.walkSMuZeroReturn933.77Unverified
reacher.easySMuZeroReturn982.26Unverified
reacher.hardSMuZeroReturn971.53Unverified
walker.runSMuZeroReturn931.06Unverified
walker.standSMuZeroReturn987.79Unverified
walker.walkSMuZeroReturn975.46Unverified

Reproductions