Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments
Liyu Chen, Haipeng Luo
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound ((B_ SAT_(_c + B_^2_P))^1/3K^2/3), where B_ is the maximum expected cost of the optimal policy of any episode starting from any state, T_ is the maximum hitting time of the optimal policy of any episode starting from the initial state, SA is the number of state-action pairs, _c and _P are the amount of changes of the cost and transition functions respectively, and K is the number of episodes. The different roles of _c and _P in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of _c and _P, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when _c and _P are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve O( _ SALK, (B_^2S^2AT_(_c+B__P))^1/3K^2/3\) regret, where L is the unknown number of changes of the environment.