SOTAVerified

Red Teaming

Papers

Showing 101150 of 251 papers

TitleStatusHype
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations0
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models0
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications0
Adversaries Can Misuse Combinations of Safe Models0
AdvAgent: Controllable Blackbox Red-teaming on Web Agents0
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents0
AI red-teaming is a sociotechnical challenge: on values, labor, and harms0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems0
A Red Teaming Roadmap Towards System-Level Safety0
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
A Safe Harbor for AI Evaluation and Red Teaming0
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI0
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning0
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code0
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester0
Automating Privilege Escalation with Deep Reinforcement Learning0
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Can Language Models be Instructed to Protect Personal Information?0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Can Large Language Models Change User Preference Adversarially?0
CELL your Model: Contrastive Explanations for Large Language Models0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
Conversational Complexity for Assessing Risk in Large Language Models0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models0
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions0
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs0
Atoxia: Red-teaming Large Language Models with Target Toxic Answers0
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization0
Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning0
DMRL: Data- and Model-aware Reward Learning for Data Extraction0
Effective Red-Teaming of Policy-Adherent Agents0
Show:102550
← PrevPage 3 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified