SOTAVerified

Enhancing RL Safety with Counterfactual LLM Reasoning

2024-09-16Code Available1· sign in to hype

Dennis Gross, Helge Spieker

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

Tasks

Reproductions