SOTAVerified

Red Teaming

Papers

Showing 5175 of 251 papers

TitleStatusHype
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Probabilistic Inference in Language Models via Twisted Sequential Monte CarloCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Red Teaming Language Models with Language ModelsCode1
Gandalf the Red: Adaptive Security for LLMsCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Red Teaming Language Model Detectors with Language ModelsCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsCode1
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language ModelsCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
Show:102550
← PrevPage 3 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified