| Embodied Red Teaming for Auditing Robotic Foundation Models | Nov 27, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 | 0 |
| ELAB: Extensive LLM Alignment Benchmark in Persian Language | Apr 17, 2025 | FairnessRed Teaming | —Unverified | 0 | 0 |
| FLIRT: Feedback Loop In-context Red Teaming | Aug 8, 2023 | In-Context LearningRed Teaming | —Unverified | 0 | 0 |
| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 | 0 |
| Effective Red-Teaming of Policy-Adherent Agents | Jun 11, 2025 | Red Teaming | —Unverified | 0 | 0 |
| DMRL: Data- and Model-aware Reward Learning for Data Extraction | May 7, 2025 | Prompt EngineeringRed Teaming | —Unverified | 0 | 0 |
| Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Dec 24, 2024 | DiversityLarge Language Model | —Unverified | 0 | 0 |
| GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | May 25, 2025 | Large Language ModelRed Teaming | —Unverified | 0 | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 | 0 |