SOTAVerified

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

2022-10-15Unverified0· sign in to hype

Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with S states, A actions and planning horizon H, we design a computational efficient algorithm to achieve near-optimal regret of O(SAH^3K(1/))O() hides logarithmic terms of (S,A,H,K) in K episodes using O(H+_2_2(K) ) batches with confidence parameter . To our best of knowledge, it is the first O(SAH^3K) regret bound with O(H+_2_2(K)) batch complexity. Meanwhile, we show that to achieve O(poly(S,A,H)K) regret, the number of batches is at least (H/_A(K)+ _2_2(K) ), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

Tasks

Reproductions