| Predictive Red Teaming: Breaking Policies Without Breaking Robots | Feb 10, 2025 | Imitation LearningRed Teaming | —Unverified | 0 |
| KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs | Feb 5, 2025 | DiversityPrompt Engineering | —Unverified | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 |
| RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts | Jan 29, 2025 | ChatbotRed Teaming | CodeCode Available | 0 |
| Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models | Jan 14, 2025 | Red Teaming | —Unverified | 0 |
| Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints | Jan 14, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| Lessons From Red Teaming 100 Generative AI Products | Jan 13, 2025 | BenchmarkingRed Teaming | —Unverified | 0 |
| Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | Jan 9, 2025 | Red Teaming | —Unverified | 0 |
| Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models | Jan 3, 2025 | Red Teaming | —Unverified | 0 |
| Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Dec 24, 2024 | DiversityLarge Language Model | —Unverified | 0 |