SOTAVerified

Red Teaming

Papers

Showing 3140 of 251 papers

TitleStatusHype
Large Language Model UnlearningCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Show:102550
← PrevPage 4 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified