| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 | 5 |
| Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Jul 20, 2024 | Red Teaming | CodeCode Available | 1 | 5 |
| Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Oct 10, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Causality Analysis for Evaluating the Security of Large Language Models | Dec 13, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction | Sep 25, 2024 | DiversityRed Teaming | CodeCode Available | 1 | 5 |
| Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 | 5 |
| Control Risk for Potential Misuse of Artificial Intelligence in Science | Dec 11, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents | Oct 11, 2024 | ChatbotRed Teaming | CodeCode Available | 1 | 5 |
| Gandalf the Red: Adaptive Security for LLMs | Jan 14, 2025 | BlockingLanguage Modeling | CodeCode Available | 1 | 5 |
| RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Sep 26, 2024 | Red Teaming | CodeCode Available | 1 | 5 |
| AI Control: Improving Safety Despite Intentional Subversion | Dec 12, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Jun 15, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs | Nov 21, 2024 | Bayesian OptimizationRed Teaming | CodeCode Available | 1 | 5 |
| DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | May 29, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |
| Red Teaming Language Model Detectors with Language Models | May 31, 2023 | Adversarial RobustnessLanguage Modeling | CodeCode Available | 1 | 5 |
| Red Teaming Language Models with Language Models | Feb 7, 2022 | ChatbotDiversity | CodeCode Available | 1 | 5 |
| Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Mar 8, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |
| Attack Prompt Generation for Red Teaming and Defending Large Language Models | Oct 19, 2023 | In-Context LearningRed Teaming | CodeCode Available | 1 | 5 |
| Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training | Mar 24, 2025 | DiversityLarge Language Model | CodeCode Available | 1 | 5 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 | 5 |
| ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models | Oct 14, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| RedDebate: Safer Responses through Multi-Agent Red Teaming Debates | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Jul 8, 2025 | Red Teaming | CodeCode Available | 0 | 5 |