| Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts | Sep 12, 2023 | Red TeamingText-to-Image Generation | CodeCode Available | 1 | 5 |
| Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Jun 15, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Gandalf the Red: Adaptive Security for LLMs | Jan 14, 2025 | BlockingLanguage Modeling | CodeCode Available | 1 | 5 |
| Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | May 29, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 | 5 |
| RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Sep 26, 2024 | Red Teaming | CodeCode Available | 1 | 5 |
| Red Teaming Language Model Detectors with Language Models | May 31, 2023 | Adversarial RobustnessLanguage Modeling | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |
| Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Mar 8, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Aug 5, 2024 | Red Teaming | CodeCode Available | 1 | 5 |