| WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Jun 26, 2024 | ChatbotRed Teaming | CodeCode Available | 2 |
| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 |
| GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs | Nov 21, 2024 | Bayesian OptimizationRed Teaming | CodeCode Available | 1 |
| ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users | May 24, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | May 29, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 |
| Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation | Feb 14, 2024 | Image GenerationRed Teaming | CodeCode Available | 1 |
| Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Jun 15, 2023 | Red Teaming | CodeCode Available | 1 |
| Gandalf the Red: Adaptive Security for LLMs | Jan 14, 2025 | BlockingLanguage Modeling | CodeCode Available | 1 |
| CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Jun 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |