SOTAVerified

Red Teaming

Papers

Showing 4150 of 251 papers

TitleStatusHype
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Show:102550
← PrevPage 5 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified