| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search | Aug 11, 2024 | Red Teaming | CodeCode Available | 0 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 |
| SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Aug 5, 2024 | Red Teaming | CodeCode Available | 1 |
| Tamper-Resistant Safeguards for Open-Weight LLMs | Aug 1, 2024 | Red TeamingTAR | CodeCode Available | 2 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Jul 23, 2024 | Red Teaming | —Unverified | 0 |
| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |