Efficient Reinforcement Learning in Probabilistic Reward Machines
Xiaofeng Lin, Xuezhou Zhang
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of O(HOAT + H^2O^2A^3/2 + HT), where H is the time horizon, O is the number of observations, A is the number of actions, and T is the number of time-steps. This result improves over the best-known bound, O(HOAT) of pmlr-v206-bourel23a for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When T H^3O^3A^2 and OA H, our regret bound leads to a regret of O(HOAT), which matches the established lower bound of (HOAT) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments.