SOTAVerified

Red Teaming

Papers

Showing 2650 of 251 papers

TitleStatusHype
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity SearchCode1
sudo rm -rf agentic_securityCode1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own ReasoningCode1
Understanding and Enhancing the Transferability of Jailbreaking AttacksCode1
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak BehaviorsCode1
Gandalf the Red: Adaptive Security for LLMsCode1
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsCode1
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingCode1
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn InteractionCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language ModelsCode1
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsCode1
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Show:102550
← PrevPage 2 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified