| Investigating Bias Representations in Llama 2 Chat via Activation Steering | Feb 1, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 |
| Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | Feb 27, 2018 | Red Teaming | —Unverified | 0 |
| CELL your Model: Contrastive Explanations for Large Language Models | Jun 17, 2024 | Red TeamingText Generation | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| IterAlign: Iterative Constitutional Alignment of Large Language Models | Mar 27, 2024 | Red Teaming | —Unverified | 0 |
| A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming | May 30, 2025 | Code GenerationDiversity | —Unverified | 0 |
| Can Large Language Models Change User Preference Adversarially? | Jan 5, 2023 | Red Teaming | —Unverified | 0 |
| A Red Teaming Roadmap Towards System-Level Safety | May 30, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models | Jun 11, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Can Language Models be Instructed to Protect Personal Information? | Oct 3, 2023 | Adversarial RobustnessRed Teaming | —Unverified | 0 |
| A Red Teaming Framework for Securing AI in Maritime Autonomous Systems | Dec 8, 2023 | Red Teaming | —Unverified | 0 |
| Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models | Mar 3, 2025 | Red TeamingSurvey | —Unverified | 0 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management | Feb 10, 2025 | ManagementRed Teaming | —Unverified | 0 |
| Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis | Oct 21, 2024 | Red Teaming | —Unverified | 0 |
| JAB: Joint Adversarial Prompting and Belief Augmentation | Nov 16, 2023 | Red Teaming | —Unverified | 0 |
| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs | May 16, 2025 | Red Teaming | —Unverified | 0 |
| FLIRT: Feedback Loop In-context Red Teaming | Aug 8, 2023 | In-Context LearningRed Teaming | —Unverified | 0 |
| GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | May 25, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents | Feb 27, 2018 | General ClassificationRed Teaming | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI | Apr 23, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |