SOTAVerified

Red Teaming

Papers

Showing 126150 of 251 papers

TitleStatusHype
CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
Conversational Complexity for Assessing Risk in Large Language Models0
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models0
Lessons From Red Teaming 100 Generative AI Products0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
LLM-Safety Evaluations Lack Robustness0
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
Low-Resource Languages Jailbreak GPT-40
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming0
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models0
Model Card and Evaluations for Claude Models0
CELL your Model: Contrastive Explanations for Large Language Models0
Multi-lingual Multi-turn Automated Red Teaming for LLMs0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Can Large Language Models Change User Preference Adversarially?0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Offensive Security for AI Systems: Concepts, Practices, and Applications0
Show:102550
← PrevPage 6 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified