SOTAVerified

Red Teaming

Papers

Showing 51100 of 251 papers

TitleStatusHype
Understanding and Enhancing the Transferability of Jailbreaking AttacksCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Jailbreaking as a Reward Misspecification ProblemCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Red Teaming Language Models with Language ModelsCode1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own ReasoningCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language ModelsCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsCode1
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language ModelsCode0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Soft Prompts for Evaluation: Measuring Conditional Distance of CapabilitiesCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Red-Teaming Segment Anything ModelCode0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Aligners: Decoupling LLMs and AlignmentCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Show:102550
← PrevPage 2 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified