SOTAVerified

Red Teaming

Papers

Showing 101150 of 251 papers

TitleStatusHype
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Distract Large Language Models for Automatic Jailbreak AttackCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
The Structural Safety Generalization ProblemCode0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent SystemsCode0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingCode0
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL AgentsCode0
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
JAB: Joint Adversarial Prompting and Belief Augmentation0
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions0
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Jailbreaking Large Language Models with Symbolic Mathematics0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
Conversational Complexity for Assessing Risk in Large Language Models0
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models0
Lessons From Red Teaming 100 Generative AI Products0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
LLM-Safety Evaluations Lack Robustness0
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
Low-Resource Languages Jailbreak GPT-40
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming0
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models0
Model Card and Evaluations for Claude Models0
CELL your Model: Contrastive Explanations for Large Language Models0
Multi-lingual Multi-turn Automated Red Teaming for LLMs0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Can Large Language Models Change User Preference Adversarially?0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Offensive Security for AI Systems: Concepts, Practices, and Applications0
Show:102550
← PrevPage 3 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified