| Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Sep 12, 2024 | Decision MakingRed Teaming | —Unverified | 0 |
| GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | May 25, 2025 | Large Language ModelRed Teaming | —Unverified | 0 |
| Gradient-Based Language Model Red Teaming | Jan 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| h4rm3l: A language for Composable Jailbreak Attack Synthesis | Aug 9, 2024 | BenchmarkingProgram Synthesis | —Unverified | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents | May 20, 2025 | Contrastive LearningRed Teaming | —Unverified | 0 |
| HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback | Mar 13, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models | Nov 25, 2024 | Red TeamingSemantic Similarity | —Unverified | 0 |
| Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis | Oct 21, 2024 | Red Teaming | —Unverified | 0 |
| Investigating Bias Representations in Llama 2 Chat via Activation Steering | Feb 1, 2024 | Decision MakingRed Teaming | —Unverified | 0 |