SOTAVerified

Red Teaming

Papers

Showing 176200 of 251 papers

TitleStatusHype
Can Large Language Models Automatically Jailbreak GPT-4V?0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
Automated Progressive Red TeamingCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Purple-teaming LLMs with Adversarial Defender Training0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Finding Safety Neurons in Large Language Models0
Adversaries Can Misuse Combinations of Safe Models0
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
CELL your Model: Contrastive Explanations for Large Language Models0
STAR: SocioTechnical Approach to Red Teaming Language Models0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Safety Alignment for Vision Language Models0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Show:102550
← PrevPage 8 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified