SOTAVerified

Red Teaming

Papers

Showing 5160 of 251 papers

TitleStatusHype
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Gandalf the Red: Adaptive Security for LLMsCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Show:102550
← PrevPage 6 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified