SOTAVerified

Red Teaming

Papers

Showing 5175 of 251 papers

TitleStatusHype
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsCode1
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsCode1
Gandalf the Red: Adaptive Security for LLMsCode1
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Red Teaming Language Model Detectors with Language ModelsCode1
Red Teaming Language Models with Language ModelsCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language ModelsCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Show:102550
← PrevPage 3 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified