SOTAVerified

Red Teaming

Papers

Showing 141150 of 251 papers

TitleStatusHype
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsCode2
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Adversaries Can Misuse Combinations of Safe Models0
Jailbreaking as a Reward Misspecification ProblemCode1
Finding Safety Neurons in Large Language Models0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Show:102550
← PrevPage 15 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified