Optimistic Q-learning for average reward and episodic reinforcement learning
Priyank Agrawal, Shipra Agrawal
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state s_0 is finite and upper bounded by H, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time for all states made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of O(H^5 SAT), where S and A are the numbers of states and actions, and T is the horizon. A key technical novelty of our work is the introduction of an L operator defined as L v = 1H _h=1^H L^h v where L denotes the Bellman operator. Under the given assumption, we show that the L operator has a strict contraction (in span) even in the average-reward setting where the discount factor is 1. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.