SOTAVerified

Red Teaming

Papers

Showing 5175 of 251 papers

TitleStatusHype
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models0
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming0
Reinforced Diffuser for Red Teaming Large Vision-Language Models0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
LLM-Safety Evaluations Lack Robustness0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own ReasoningCode1
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Fast Proxies for LLM Robustness Evaluation0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
Predictive Red Teaming: Breaking Policies Without Breaking Robots0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
Understanding and Enhancing the Transferability of Jailbreaking AttacksCode1
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak BehaviorsCode1
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models0
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints0
Show:102550
← PrevPage 3 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified