SOTAVerified

Red Teaming

Papers

Showing 5160 of 251 papers

TitleStatusHype
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
Jailbreaking as a Reward Misspecification ProblemCode1
OET: Optimization-based prompt injection Evaluation ToolkitCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Show:102550
← PrevPage 6 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified