Enhancing RL Safety with Counterfactual LLM Reasoning

2024-09-16Code Available1· sign in to hype

Dennis Gross, Helge Spieker

Code Available — Be the first to reproduce this paper.

Code

github.com/lava-lab/cool-mc
Officialpytorch★ 16

Abstract

Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.

Tasks

counterfactual Language Modeling Language Modelling Large Language Model reinforcement-learning Reinforcement Learning Reinforcement Learning (RL)

Enhancing RL Safety with Counterfactual LLM Reasoning

Code

Abstract

Tasks

Reproductions