Enhancing RL Safety with Counterfactual LLM Reasoning
2024-09-16Code Available1· sign in to hype
Dennis Gross, Helge Spieker
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/lava-lab/cool-mcOfficialpytorch★ 16
Abstract
Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual large language model reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.