| Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations | Oct 9, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Jul 23, 2024 | Red Teaming | —Unverified | 0 |
| X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Apr 15, 2025 | DiversityRed Teaming | —Unverified | 0 |
| LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs | May 16, 2025 | Red Teaming | —Unverified | 0 |
| GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models | Jun 11, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Nov 14, 2023 | DiversityRed Teaming | —Unverified | 0 |
| Adversaries Can Misuse Combinations of Safe Models | Jun 20, 2024 | Red Teaming | —Unverified | 0 |
| AdvAgent: Controllable Blackbox Red-teaming on Web Agents | Oct 22, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation | Oct 15, 2024 | Anomaly DetectionRed Teaming | —Unverified | 0 |
| A Framework for Evaluating Emerging Cyberattack Capabilities of AI | Mar 14, 2025 | Red Teaming | —Unverified | 0 |
| A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management | Feb 10, 2025 | ManagementRed Teaming | —Unverified | 0 |
| AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents | May 9, 2025 | NavigateRed Teaming | —Unverified | 0 |
| AI red-teaming is a sociotechnical challenge: on values, labor, and harms | Dec 12, 2024 | Red Teaming | —Unverified | 0 |
| A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI | Apr 23, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents | Feb 27, 2018 | General ClassificationRed Teaming | —Unverified | 0 |
| A Red Teaming Framework for Securing AI in Maritime Autonomous Systems | Dec 8, 2023 | Red Teaming | —Unverified | 0 |
| A Red Teaming Roadmap Towards System-Level Safety | May 30, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming | May 30, 2025 | Code GenerationDiversity | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| A Safe Harbor for AI Evaluation and Red Teaming | Mar 7, 2024 | Red Teaming | —Unverified | 0 |
| Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI | Sep 23, 2024 | Red Teaming | —Unverified | 0 |
| AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning | Feb 21, 2024 | Graph Neural NetworkRed Teaming | —Unverified | 0 |
| Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code | Mar 30, 2024 | Continual PretrainingLanguage Modelling | —Unverified | 0 |
| Automated Red Teaming with GOAT: the Generative Offensive Agent Tester | Oct 2, 2024 | Red Teaming | —Unverified | 0 |
| Automating Privilege Escalation with Deep Reinforcement Learning | Oct 4, 2021 | BIG-bench Machine LearningDeep Reinforcement Learning | —Unverified | 0 |
| AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration | Mar 20, 2025 | Red Teaming | —Unverified | 0 |
| Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models | Jan 3, 2025 | Red Teaming | —Unverified | 0 |
| Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming | Feb 22, 2025 | DiversityIn-Context Learning | —Unverified | 0 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models | Mar 3, 2025 | Red TeamingSurvey | —Unverified | 0 |
| Can Language Models be Instructed to Protect Personal Information? | Oct 3, 2023 | Adversarial RobustnessRed Teaming | —Unverified | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Can Large Language Models Change User Preference Adversarially? | Jan 5, 2023 | Red Teaming | —Unverified | 0 |
| CELL your Model: Contrastive Explanations for Large Language Models | Jun 17, 2024 | Red TeamingText Generation | —Unverified | 0 |
| Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | Feb 27, 2018 | Red Teaming | —Unverified | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 |
| CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring | May 29, 2025 | Red Teaming | —Unverified | 0 |
| CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models | Aug 16, 2022 | Red Teaming | —Unverified | 0 |
| CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge | Apr 10, 2024 | Red Teaming | —Unverified | 0 |
| CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | May 19, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions | Dec 7, 2023 | Code GenerationRed Teaming | —Unverified | 0 |
| Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs | Oct 31, 2024 | Red Teaming | —Unverified | 0 |
| Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Aug 27, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 |
| Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread | Jan 23, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 |
| Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Dec 24, 2024 | DiversityLarge Language Model | —Unverified | 0 |
| DMRL: Data- and Model-aware Reward Learning for Data Extraction | May 7, 2025 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| Effective Red-Teaming of Policy-Adherent Agents | Jun 11, 2025 | Red Teaming | —Unverified | 0 |