Off-policy Learning with Eligibility Traces: A Survey
Matthieu Geist, Bruno Scherrer
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In the framework of Markov Decision Processes, off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares-based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms - off-policy LSTD( ), LSPE( ), TD( ), TDC/GQ( ) - and suggests new extensions - off-policy FPKF( ), BRM( ), gBRM( ), GTD2( ). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD( )/LSPE( ) - and TD( ) if the feature space dimension is too large for a least-squares approach - perform the best.