SOTAVerified

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

2023-04-14Unverified0· sign in to hype

Gen Li, Yuling Yan, Yuxin Chen, Jianqing Fan

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper studies reward-agnostic exploration in reinforcement learning (RL) -- a scenario where the learner is unware of the reward functions during the exploration stage -- and designs an algorithm that improves over the state of the art. More precisely, consider a finite-horizon inhomogeneous Markov decision process with S states, A actions, and horizon length H, and suppose that there are no more than a polynomial number of given reward functions of interest. By collecting an order of align* SAH^3 ^2 sample episodes (up to log factor) align* without guidance of the reward information, our algorithm is able to find -optimal policies for all these reward functions, provided that is sufficiently small. This forms the first reward-agnostic exploration scheme in this context that achieves provable minimax optimality. Furthermore, once the sample size exceeds S^2AH^3^2 episodes (up to log factor), our algorithm is able to yield accuracy for arbitrarily many reward functions (even when they are adversarially designed), a task commonly dubbed as ``reward-free exploration.'' The novelty of our algorithm design draws on insights from offline RL: the exploration scheme attempts to maximize a critical reward-agnostic quantity that dictates the performance of offline RL, while the policy learning paradigm leverages ideas from sample-optimal offline RL paradigms.

Tasks

Reproductions