SOTAVerified

Red Teaming

Papers

Showing 126150 of 251 papers

TitleStatusHype
Automating Privilege Escalation with Deep Reinforcement Learning0
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Can Language Models be Instructed to Protect Personal Information?0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Can Large Language Models Change User Preference Adversarially?0
CELL your Model: Contrastive Explanations for Large Language Models0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
Conversational Complexity for Assessing Risk in Large Language Models0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models0
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions0
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs0
Atoxia: Red-teaming Large Language Models with Target Toxic Answers0
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization0
Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning0
DMRL: Data- and Model-aware Reward Learning for Data Extraction0
Show:102550
← PrevPage 6 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified