| EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection | May 20, 2025 | Red Teaming | —Unverified | 0 | 0 |
| Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity | Jan 30, 2023 | EthicsLanguage Modelling | —Unverified | 0 | 0 |
| Exploring Straightforward Conversational Red-Teaming | Sep 7, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation | May 24, 2025 | Intent DetectionNatural Language Understanding | —Unverified | 0 | 0 |
| Fast Proxies for LLM Robustness Evaluation | Feb 14, 2025 | Red Teaming | —Unverified | 0 | 0 |
| Embodied Red Teaming for Auditing Robotic Foundation Models | Nov 27, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 | 0 |
| ELAB: Extensive LLM Alignment Benchmark in Persian Language | Apr 17, 2025 | FairnessRed Teaming | —Unverified | 0 | 0 |
| FLIRT: Feedback Loop In-context Red Teaming | Aug 8, 2023 | In-Context LearningRed Teaming | —Unverified | 0 | 0 |
| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 | 0 |
| Effective Red-Teaming of Policy-Adherent Agents | Jun 11, 2025 | Red Teaming | —Unverified | 0 | 0 |
| DMRL: Data- and Model-aware Reward Learning for Data Extraction | May 7, 2025 | Prompt EngineeringRed Teaming | —Unverified | 0 | 0 |
| Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Dec 24, 2024 | DiversityLarge Language Model | —Unverified | 0 | 0 |
| GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | May 25, 2025 | Large Language ModelRed Teaming | —Unverified | 0 | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread | Jan 23, 2024 | MisinformationRed Teaming | —Unverified | 0 | 0 |
| Gradient-Based Language Model Red Teaming | Jan 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 | 0 |
| DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Aug 18, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents | May 20, 2025 | Contrastive LearningRed Teaming | —Unverified | 0 | 0 |
| Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Aug 27, 2024 | Prompt EngineeringRed Teaming | —Unverified | 0 | 0 |
| HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback | Mar 13, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs | Oct 31, 2024 | Red Teaming | —Unverified | 0 | 0 |
| In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models | Nov 25, 2024 | Red TeamingSemantic Similarity | —Unverified | 0 | 0 |