SOTAVerified

Red Teaming

Papers

Showing 51100 of 251 papers

TitleStatusHype
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsCode1
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity SearchCode1
Catastrophic Jailbreak of Open-source LLMs via Exploiting GenerationCode1
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
Probabilistic Inference in Language Models via Twisted Sequential Monte CarloCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn JailbreakingCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
Jailbroken: How Does LLM Safety Training Fail?Code1
Jailbreaking as a Reward Misspecification ProblemCode1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and BiasesCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Large Language Model UnlearningCode1
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
Attack Prompt Generation for Red Teaming and Defending Large Language ModelsCode1
Gandalf the Red: Adaptive Security for LLMsCode1
A Safe Harbor for AI Evaluation and Red Teaming0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
Adversaries Can Misuse Combinations of Safe Models0
Conversational Complexity for Assessing Risk in Large Language Models0
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
CELL your Model: Contrastive Explanations for Large Language Models0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming0
Can Large Language Models Change User Preference Adversarially?0
A Red Teaming Roadmap Towards System-Level Safety0
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Can Language Models be Instructed to Protect Personal Information?0
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
JAB: Joint Adversarial Prompting and Belief Augmentation0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
FLIRT: Feedback Loop In-context Red Teaming0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
Finding Safety Neurons in Large Language Models0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Show:102550
← PrevPage 2 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified