SOTAVerified

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

2024-10-14Unverified0· sign in to hype

Kihyun Yu, Duksang Lee, William Overman, Dabeen Lee

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of O(( C - C_b)^-1H^2.5 SAK) where C is the cost budget for an episode, C_b is the expected cost under a safe baseline policy over an episode, H is the horizon, and S, A and K are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when C- C_b=(H), it nearly matches the regret lower bound of (H^1.5SAK). We deduce our cost and reward function estimators via a Bellman-type law of total variance to obtain tight bounds on the expected sum of the variances of value function estimates. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.

Tasks

Reproductions