SOTAVerified

Red Teaming

Papers

Showing 101125 of 251 papers

TitleStatusHype
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Automated Progressive Red TeamingCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
No Offense Taken: Eliciting Offensiveness from Language ModelsCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Conversational Complexity for Assessing Risk in Large Language Models0
A Safe Harbor for AI Evaluation and Red Teaming0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
Jailbreaking Large Language Models with Symbolic Mathematics0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
JAB: Joint Adversarial Prompting and Belief Augmentation0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Show:102550
← PrevPage 5 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified