SOTAVerified

Red Teaming

Papers

Showing 101110 of 251 papers

TitleStatusHype
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Soft Prompts for Evaluation: Measuring Conditional Distance of CapabilitiesCode0
Overriding Safety protections of Open-source ModelsCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
No Offense Taken: Eliciting Offensiveness from Language ModelsCode0
Automated Progressive Red TeamingCode0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree SearchCode0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Show:102550
← PrevPage 11 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified