| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 |
| Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training | Mar 24, 2025 | DiversityLarge Language Model | CodeCode Available | 1 |
| Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Jun 15, 2023 | Red Teaming | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | May 29, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 |
| Jailbreaking as a Reward Misspecification Problem | Jun 20, 2024 | Red Teaming | CodeCode Available | 1 |
| Gandalf the Red: Adaptive Security for LLMs | Jan 14, 2025 | BlockingLanguage Modeling | CodeCode Available | 1 |
| Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Mar 8, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts | Sep 12, 2023 | Red TeamingText-to-Image Generation | CodeCode Available | 1 |