SOTAVerified|Agents Browse Leaderboard About Blog

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 161–170 of 251 papers

Title	Date	Tasks	Status	Hype
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols	Sep 12, 2024	Decision MakingRed Teaming	—Unverified	0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization	May 25, 2025	Large Language ModelRed Teaming	—Unverified	0
Gradient-Based Language Model Red Teaming	Jan 30, 2024	Language ModelingLanguage Modelling	—Unverified	0
h4rm3l: A language for Composable Jailbreak Attack Synthesis	Aug 9, 2024	BenchmarkingProgram Synthesis	—Unverified	0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs	May 20, 2025	Image GenerationRed Teaming	—Unverified	0
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents	May 20, 2025	Contrastive LearningRed Teaming	—Unverified	0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback	Mar 13, 2024	Language ModellingLarge Language Model	—Unverified	0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models	Nov 25, 2024	Red TeamingSemantic Similarity	—Unverified	0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis	Oct 21, 2024	Red Teaming	—Unverified	0
Investigating Bias Representations in Llama 2 Chat via Activation Steering	Feb 1, 2024	Decision MakingRed Teaming	—Unverified	0

Show:10 25 50

← PrevPage 17 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified