RL^3: Boosting Meta Reinforcement Learning via RL inside RL^2
Abhinav Bhatia, Samer B. Nashed, Shlomo Zilberstein
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/bhatiaabhinav/rl3OfficialIn papernone★ 8
Abstract
Meta reinforcement learning (meta-RL) methods such as RL^2 have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but do converge to an optimal policy in the limit. We propose RL^3, a principled hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to meta-RL. We show that RL^3 earns greater cumulative reward in the long term compared to RL^2 while drastically reducing meta-training time and generalizes better to out-of-distribution tasks. Experiments are conducted on both custom and benchmark discrete domains from the meta-RL literature that exhibit a range of short-term, long-term, and complex dependencies.