| Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations | Oct 9, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs | Oct 3, 2024 | Red Teaming | CodeCode Available | 3 |
| SteerDiff: Steering towards Safe Text-to-Image Diffusion Models | Oct 3, 2024 | Image GenerationRed Teaming | —Unverified | 0 |
| Automated Red Teaming with GOAT: the Generative Offensive Agent Tester | Oct 2, 2024 | Red Teaming | —Unverified | 0 |
| PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System | Oct 1, 2024 | Red Teaming | CodeCode Available | 7 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Sep 26, 2024 | Red Teaming | CodeCode Available | 1 |
| Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction | Sep 25, 2024 | DiversityRed Teaming | CodeCode Available | 1 |
| Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI | Sep 23, 2024 | Red Teaming | —Unverified | 0 |
| Jailbreaking Large Language Models with Symbolic Mathematics | Sep 17, 2024 | Red Teaming | —Unverified | 0 |
| What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing | Sep 14, 2024 | Red Teaming | CodeCode Available | 0 |
| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| Exploring Straightforward Conversational Red-Teaming | Sep 7, 2024 | Red Teaming | —Unverified | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 |
| Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness | Aug 31, 2024 | FairnessLanguage Modeling | —Unverified | 0 |
| LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Aug 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Aug 27, 2024 | Red TeamingTransfer Learning | CodeCode Available | 0 |
| Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Aug 27, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search | Aug 11, 2024 | Red Teaming | CodeCode Available | 0 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 |
| SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Aug 5, 2024 | Red Teaming | CodeCode Available | 1 |
| Tamper-Resistant Safeguards for Open-Weight LLMs | Aug 1, 2024 | Red TeamingTAR | CodeCode Available | 2 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Jul 23, 2024 | Red Teaming | —Unverified | 0 |
| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Jul 20, 2024 | Red Teaming | CodeCode Available | 1 |
| Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Jul 18, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 |
| AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | Jul 17, 2024 | Autonomous DrivingBackdoor Attack | CodeCode Available | 3 |
| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 |
| ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing | Jul 10, 2024 | FairnessRed Teaming | —Unverified | 0 |
| Automated Progressive Red Teaming | Jul 4, 2024 | Active LearningRed Teaming | CodeCode Available | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Purple-teaming LLMs with Adversarial Defender Training | Jul 1, 2024 | Generative Adversarial NetworkRed Teaming | —Unverified | 0 |
| WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Jun 26, 2024 | ChatbotRed Teaming | CodeCode Available | 2 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Jun 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations | Jun 25, 2024 | Red TeamingReinforcement Learning (RL) | —Unverified | 0 |
| Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Jun 21, 2024 | Red TeamingTruthfulQA | CodeCode Available | 0 |
| Adversaries Can Misuse Combinations of Safe Models | Jun 20, 2024 | Red Teaming | —Unverified | 0 |
| Jailbreaking as a Reward Misspecification Problem | Jun 20, 2024 | Red Teaming | CodeCode Available | 1 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |