SOTAVerified

Red Teaming

Papers

Showing 126150 of 251 papers

TitleStatusHype
Can Large Language Models Automatically Jailbreak GPT-4V?0
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent0
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsCode1
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)Code1
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge BasesCode3
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion ModelsCode2
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
Automated Progressive Red TeamingCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Purple-teaming LLMs with Adversarial Defender Training0
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language ModelsCode2
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue CoreferenceCode1
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Adversaries Can Misuse Combinations of Safe Models0
Jailbreaking as a Reward Misspecification ProblemCode1
Finding Safety Neurons in Large Language Models0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
Show:102550
← PrevPage 6 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified