| Automating Privilege Escalation with Deep Reinforcement Learning | Oct 4, 2021 | BIG-bench Machine LearningDeep Reinforcement Learning | —Unverified | 0 |
| AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration | Mar 20, 2025 | Red Teaming | —Unverified | 0 |
| Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models | Jan 3, 2025 | Red Teaming | —Unverified | 0 |
| Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming | Feb 22, 2025 | DiversityIn-Context Learning | —Unverified | 0 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models | Mar 3, 2025 | Red TeamingSurvey | —Unverified | 0 |
| Can Language Models be Instructed to Protect Personal Information? | Oct 3, 2023 | Adversarial RobustnessRed Teaming | —Unverified | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Can Large Language Models Change User Preference Adversarially? | Jan 5, 2023 | Red Teaming | —Unverified | 0 |
| CELL your Model: Contrastive Explanations for Large Language Models | Jun 17, 2024 | Red TeamingText Generation | —Unverified | 0 |
| Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | Feb 27, 2018 | Red Teaming | —Unverified | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 |
| CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring | May 29, 2025 | Red Teaming | —Unverified | 0 |
| CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models | Aug 16, 2022 | Red Teaming | —Unverified | 0 |
| CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge | Apr 10, 2024 | Red Teaming | —Unverified | 0 |
| CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | May 19, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions | Dec 7, 2023 | Code GenerationRed Teaming | —Unverified | 0 |
| Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs | Oct 31, 2024 | Red Teaming | —Unverified | 0 |
| Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Aug 27, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 |
| Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread | Jan 23, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 |
| Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Dec 24, 2024 | DiversityLarge Language Model | —Unverified | 0 |
| DMRL: Data- and Model-aware Reward Learning for Data Extraction | May 7, 2025 | Prompt EngineeringRed Teaming | —Unverified | 0 |