SOTAVerified

Red Teaming

Papers

Showing 101125 of 251 papers

TitleStatusHype
Offensive Security for AI Systems: Concepts, Practices, and Applications0
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents0
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods0
DMRL: Data- and Model-aware Reward Learning for Data Extraction0
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs0
Red Teaming Large Language Models for Healthcare0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
Understanding and Mitigating Risks of Generative AI in Financial Services0
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models0
ELAB: Extensive LLM Alignment Benchmark in Persian Language0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
The Structural Safety Generalization ProblemCode0
Multi-lingual Multi-turn Automated Red Teaming for LLMs0
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning0
Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review0
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models0
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming0
Reinforced Diffuser for Red Teaming Large Vision-Language Models0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
Show:102550
← PrevPage 5 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified