SOTAVerified

Understanding & Generalizing AlphaGo Zero

2019-05-01ICLR 2019Unverified0· sign in to hype

Ravichandra Addanki, Mohammad Alizadeh, Shaileshh Bojja Venkatakrishnan, Devavrat Shah, Qiaomin Xie, Zhi Xu

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

AlphaGo Zero (AGZ) introduced a new tabula rasa reinforcement learning algorithm that has achieved superhuman performance in the games of Go, Chess, and Shogi with no prior knowledge other than the rules of the game. This success naturally begs the question whether it is possible to develop similar high-performance reinforcement learning algorithms for generic sequential decision-making problems (beyond two-player games), using only the constraints of the environment as the ``rules.'' To address this challenge, we start by taking steps towards developing a formal understanding of AGZ. AGZ includes two key innovations: (1) it learns a policy (represented as a neural network) using supervised learning with cross-entropy loss from samples generated via Monte-Carlo Tree Search (MCTS); (2) it uses self-play to learn without training data. We argue that the self-play in AGZ corresponds to learning a Nash equilibrium for the two-player game; and the supervised learning with MCTS is attempting to learn the policy corresponding to the Nash equilibrium, by establishing a novel bound on the difference between the expected return achieved by two policies in terms of the expected KL divergence (cross-entropy) of their induced distributions. To extend AGZ to generic sequential decision-making problems, we introduce a robust MDP framework, in which the agent and nature effectively play a zero-sum game: the agent aims to take actions to maximize reward while nature seeks state transitions, subject to the constraints of that environment, that minimize the agent's reward. For a challenging network scheduling domain, we find that AGZ within the robust MDP framework provides near-optimal performance, matching one of the best known scheduling policies that has taken the networking community three decades of intensive research to develop.

Tasks

Reproductions