SOTAVerified

Red Teaming

Papers

Showing 101125 of 251 papers

TitleStatusHype
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
Automated Progressive Red TeamingCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
No Offense Taken: Eliciting Offensiveness from Language ModelsCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Conversational Complexity for Assessing Risk in Large Language Models0
A Safe Harbor for AI Evaluation and Red Teaming0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
Jailbreaking Large Language Models with Symbolic Mathematics0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
JAB: Joint Adversarial Prompting and Belief Augmentation0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
Show:102550
← PrevPage 5 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified