| Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors | Jan 24, 2025 | Red Teaming | CodeCode Available | 1 |
| Gandalf the Red: Adaptive Security for LLMs | Jan 14, 2025 | BlockingLanguage Modeling | CodeCode Available | 1 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs | Nov 21, 2024 | Bayesian OptimizationRed Teaming | CodeCode Available | 1 |
| Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents | Oct 11, 2024 | ChatbotRed Teaming | CodeCode Available | 1 |
| RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Sep 26, 2024 | Red Teaming | CodeCode Available | 1 |
| Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction | Sep 25, 2024 | DiversityRed Teaming | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Aug 5, 2024 | Red Teaming | CodeCode Available | 1 |
| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 |