Population-Guided Imitation Learning
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Learning to imitate expert behavior is a challenging problem, especially in environments with high-dimensional, continuous observations and unknown dynamics. It includes imitation learning from demonstrations (ILfD) and imitation learning from observations (ILfO). The simplest methods in imitation learning are behavior cloning (BC) and behavior cloning from observations (BCO), for ILfD and ILfO respectively. But BC suffers from the problem of distribution shift, while in ILfO, the inverse dynamic model heavily depends on the current policy, without sufficient generalization to the expert state distribution. Since there is no easily-specified reward function available, exploration is more important for imitation learning than regular RL. In this paper, and we propose population-based exploration techniques for imitation learning, which are simple to implement and improve sample efficiency significantly. And we find that population-based exploration can have much more performance improvement than that in regular RL problems. In ILfD, to enlarge the overall search region, we propose to use Stein Variational Gradient Descent (SVGD) to generate the multiple policies, and attenuate distribution shift by RL with intrinsic rewards. In ILfO, additionally, in order to produce more diverse state-action pairs to make the inverse dynamic model generalize better, we introduce neuro-evolution (NE) to further augment the exploration capability of learning policies. We find these population-based exploration techniques can have more performance improvement in imitation learning, than regular RL problems. Here the intrinsic rewards are simply generated by random network distillation (RND), trained over expert states. The proposed frameworks provide the imitation agent both the intrinsic intention of the demonstrator and better exploration ability, which is critical for the agent to outperform the demonstrator. With experiments of ILfD and ILfO over various difficult Atari games and MuJoCo environments, the proposed exploration-augmented methods show significant performance improvement, especially achieving 5X better sampling efficiency, compared with previous popular imitation learning methods.