SOTAVerified

Red Teaming

Papers

Showing 2650 of 251 papers

TitleStatusHype
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
Offensive Security for AI Systems: Concepts, Practices, and Applications0
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents0
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods0
DMRL: Data- and Model-aware Reward Learning for Data Extraction0
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs0
Red Teaming Large Language Models for Healthcare0
OET: Optimization-based prompt injection Evaluation ToolkitCode1
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models0
Understanding and Mitigating Risks of Generative AI in Financial Services0
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity SearchCode1
ELAB: Extensive LLM Alignment Benchmark in Persian Language0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
The Structural Safety Generalization ProblemCode0
Multi-lingual Multi-turn Automated Red Teaming for LLMs0
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning0
sudo rm -rf agentic_securityCode1
Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review0
Show:102550
← PrevPage 2 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified