SOTAVerified

Red Teaming

Papers

Showing 3140 of 251 papers

TitleStatusHype
Gandalf the Red: Adaptive Security for LLMsCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Show:102550
← PrevPage 4 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified