SOTAVerified

Red Teaming

Papers

Showing 161170 of 251 papers

TitleStatusHype
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization0
Gradient-Based Language Model Red Teaming0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Show:102550
← PrevPage 17 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified