Uncertainty - sensitive learning and planning with ensembles
Piotr Miłoś, Łukasz Kuciński, Konrad Czechowski, Piotr Kozakowski, Maciej Klimek
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/learningandplanningICLR/learningandplanningOfficialIn papertf★ 0
Abstract
We propose a reinforcement learning framework for discrete environments in which an agent optimizes its behavior on two timescales. For the short one, it uses tree search methods to perform tactical decisions. The long strategic level is handled with an ensemble of value functions learned using TD-like backups. Combining these two techniques brings synergies. The planning module performs what-if analysis allowing to avoid short-term pitfalls and boost backups of the value function. Notably, our method performs well in environments with sparse rewards where standard TD(1) backups fail. On the other hand, the value functions compensate for inherent short-sightedness of planning. Importantly, we use ensembles to measure the epistemic uncertainty of value functions. This serves two purposes: a) it stabilizes planning, b) it guides exploration. We evaluate our methods on discrete environments with sparse rewards: the Deep sea chain environment, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up of learning and boost to the final performance.