SOTAVerified

Red Teaming

Papers

Showing 2650 of 251 papers

TitleStatusHype
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
OET: Optimization-based prompt injection Evaluation ToolkitCode1
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
Probabilistic Inference in Language Models via Twisted Sequential Monte CarloCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
Query-Efficient Black-Box Red Teaming via Bayesian OptimizationCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and BiasesCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Show:102550
← PrevPage 2 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified