An Adiabatic Theorem for Policy Tracking with TD-learning

2020-10-24Unverified0· sign in to hype

Neil Walton

Unverified — Be the first to reproduce this paper.

Abstract

We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and Q-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.

Tasks

Q-Learning

An Adiabatic Theorem for Policy Tracking with TD-learning

Abstract

Tasks

Reproductions