SOTAVerified

Red Teaming

Papers

Showing 5175 of 251 papers

TitleStatusHype
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Large Language Model UnlearningCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and BiasesCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Gandalf the Red: Adaptive Security for LLMsCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Red Teaming Language Model Detectors with Language ModelsCode1
A Safe Harbor for AI Evaluation and Red Teaming0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
Adversaries Can Misuse Combinations of Safe Models0
Conversational Complexity for Assessing Risk in Large Language Models0
Show:102550
← PrevPage 3 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified