Improved Algorithms for Misspecified Linear Markov Decision Processes
Daniel Vial, Advait Parulekar, Sanjay Shakkottai, R. Srikant
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
For the misspecified linear Markov decision process (MLMDP) model of Jin et al. [2020], we propose an algorithm with three desirable properties. (P1) Its regret after K episodes scales as K \ _mis, _tol \, where _mis is the degree of misspecification and _tol is a user-specified error tolerance. (P2) Its space and per-episode time complexities remain bounded as K . (P3) It does not require _mis as input. To our knowledge, this is the first algorithm satisfying all three properties. For concrete choices of _tol, we also improve existing regret bounds (up to log factors) while achieving either (P2) or (P3) (existing algorithms satisfy neither). At a high level, our algorithm generalizes (to MLMDPs) and refines the Sup-Lin-UCB algorithm, which Takemura et al. [2021] recently showed satisfies (P3) for contextual bandits. We also provide an intuitive interpretation of their result, which informs the design of our algorithm.