| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 |
| WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Jun 26, 2024 | ChatbotRed Teaming | CodeCode Available | 2 |
| Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt | Jun 6, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Improved Techniques for Optimization-Based Jailbreaking on Large Language Models | May 31, 2024 | Red Teaming | CodeCode Available | 2 |
| AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs | Apr 21, 2024 | MMLURed Teaming | CodeCode Available | 2 |
| ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming | Apr 6, 2024 | Adversarial RobustnessDialogue Safety Prediction | CodeCode Available | 2 |
| Against The Achilles' Heel: A Survey on Red Teaming for Generative Models | Mar 31, 2024 | Red TeamingSurvey | CodeCode Available | 2 |
| Curiosity-driven Red-teaming for Large Language Models | Feb 29, 2024 | Red TeamingReinforcement Learning (RL) | CodeCode Available | 2 |
| Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Feb 13, 2024 | Language ModellingLarge Language Model | CodeCode Available | 2 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |