SOTAVerified

Red Teaming

Papers

Showing 101125 of 251 papers

TitleStatusHype
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations0
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsCode3
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models0
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester0
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI SystemCode7
Overriding Safety protections of Open-source ModelsCode0
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI0
Jailbreaking Large Language Models with Symbolic Mathematics0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingCode0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
Exploring Straightforward Conversational Red-Teaming0
Conversational Complexity for Assessing Risk in Large Language Models0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetCode2
Atoxia: Red-teaming Large Language Models with Target Toxic Answers0
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree SearchCode0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language ModelsCode1
Tamper-Resistant Safeguards for Open-Weight LLMsCode2
Show:102550
← PrevPage 5 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified