| What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing | Sep 14, 2024 | Red Teaming | CodeCode Available | 0 |
| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| Exploring Straightforward Conversational Red-Teaming | Sep 7, 2024 | Red Teaming | —Unverified | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 |
| Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness | Aug 31, 2024 | FairnessLanguage Modeling | —Unverified | 0 |
| LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Aug 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Aug 27, 2024 | Red TeamingTransfer Learning | CodeCode Available | 0 |
| Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Aug 27, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 |