| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search | Aug 11, 2024 | Red Teaming | CodeCode Available | 0 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 |
| RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Jul 23, 2024 | Red Teaming | —Unverified | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Jul 18, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 |