SOTAVerified

Red Teaming

Papers

Showing 151175 of 251 papers

TitleStatusHype
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
STAR: SocioTechnical Approach to Red Teaming Language Models0
CELL your Model: Contrastive Explanations for Large Language Models0
garak: A Framework for Security Probing Large Language ModelsCode9
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
Jailbreak Vision Language Models via Bi-Modal Adversarial PromptCode2
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsCode1
Improved Techniques for Optimization-Based Jailbreaking on Large Language ModelsCode2
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Safety Alignment for Vision Language Models0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Probabilistic Inference in Language Models via Twisted Sequential Monte CarloCode1
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMsCode2
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge0
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red TeamingCode2
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
Red-Teaming Segment Anything ModelCode0
Show:102550
← PrevPage 7 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified