| Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? | Oct 16, 2023 | Red Teaming | CodeCode Available | 1 |
| A Safe Harbor for AI Evaluation and Red Teaming | Mar 7, 2024 | Red Teaming | —Unverified | 0 |
| CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring | May 29, 2025 | Red Teaming | —Unverified | 0 |
| Adversaries Can Misuse Combinations of Safe Models | Jun 20, 2024 | Red Teaming | —Unverified | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 |
| Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | Feb 27, 2018 | Red Teaming | —Unverified | 0 |
| CELL your Model: Contrastive Explanations for Large Language Models | Jun 17, 2024 | Red TeamingText Generation | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |