Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

2025-05-30Unverified0· sign in to hype

Oliver Mortensen, Mohammad Sadegh Talebi

Unverified — Be the first to reproduce this paper.

Abstract

In this paper we analyze the sample complexities of learning the optimal state-action value function Q^* and an optimal policy ^* in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter 0 and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive Q-value-iteration (MB-RS-QVI) which leads to (,)-PAC-bounds on \|Q^*-Q^k\|, and \|V^*-V^_k\| where Q_k is the output of MB-RS-QVI after k iterations and _k is the greedy policy with respect to Q_k. Both PAC-bounds have exponential dependence on the effective horizon 11- and the strength of this dependence grows with the learners risk-sensitivity ||. We also provide two lower bounds which shows that exponential dependence on ||11- is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in and and that the PAC-bound on Q-learning is tight in the number of actions A, and that the PAC-bound on policy-learning is nearly tight in A.

Tasks

Q-Learning

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Abstract

Tasks

Reproductions