SOTAVerified

Red Teaming

Papers

Showing 51100 of 251 papers

TitleStatusHype
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
Understanding and Enhancing the Transferability of Jailbreaking AttacksCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and BiasesCode1
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
OET: Optimization-based prompt injection Evaluation ToolkitCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own ReasoningCode1
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language ModelsCode0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisCode0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
Red-Teaming Segment Anything ModelCode0
Soft Prompts for Evaluation: Measuring Conditional Distance of CapabilitiesCode0
The Structural Safety Generalization ProblemCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Overriding Safety protections of Open-source ModelsCode0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Aligners: Decoupling LLMs and AlignmentCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
Show:102550
← PrevPage 2 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified