SOTAVerified

Red Teaming

Papers

Showing 2130 of 251 papers

TitleStatusHype
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetCode2
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsCode2
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
Show:102550
← PrevPage 3 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified