SOTAVerified

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

2026-03-25Unverified0· sign in to hype

Guy Zamir, Matthew Zurek, Yudong Chen

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the γ-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form O( SA\,Var + lower-order terms), where S,A are the state and action space sizes, and Var captures cumulative transition variance. This implies minimax-optimal average-reward and γ-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span h^_sp, our algorithm obtains lower-order terms scaling as h^_sp S^2 A, which we prove is optimal in both h^_sp and A. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than h^ _sp^2 S A, and we provide a prior-free algorithm whose lower-order terms scale as h^_sp^2 S^3 A, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on h^_sp in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.

Reproductions