SOTAVerified

Red Teaming

Papers

Showing 2130 of 251 papers

TitleStatusHype
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
Gandalf the Red: Adaptive Security for LLMsCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Show:102550
← PrevPage 3 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified