SOTAVerified

Red Teaming

Papers

Showing 150 of 251 papers

TitleStatusHype
garak: A Framework for Security Probing Large Language ModelsCode9
PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI SystemCode7
Seamless: Multilingual Expressive and Streaming Speech TranslationCode6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust RefusalCode4
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge BasesCode3
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsCode3
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons LearnedCode3
Jailbreak Vision Language Models via Bi-Modal Adversarial PromptCode2
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak PromptsCode2
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastCode2
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
Against The Achilles' Heel: A Survey on Red Teaming for Generative ModelsCode2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
Curiosity-driven Red-teaming for Large Language ModelsCode2
Tamper-Resistant Safeguards for Open-Weight LLMsCode2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
Improved Techniques for Optimization-Based Jailbreaking on Large Language ModelsCode2
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red TeamingCode2
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMsCode2
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsCode2
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetCode2
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
Query-Efficient Black-Box Red Teaming via Bayesian OptimizationCode1
Jailbreaking as a Reward Misspecification ProblemCode1
OET: Optimization-based prompt injection Evaluation ToolkitCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and BiasesCode1
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
Large Language Model UnlearningCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Show:102550
← PrevPage 1 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified