SOTAVerified

Red Teaming

Papers

Showing 151200 of 251 papers

TitleStatusHype
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' JailbreakCode1
STAR: SocioTechnical Approach to Red Teaming Language Models0
CELL your Model: Contrastive Explanations for Large Language Models0
garak: A Framework for Security Probing Large Language ModelsCode9
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language ModelsCode1
Jailbreak Vision Language Models via Bi-Modal Adversarial PromptCode2
Unelicitable Backdoors in Language Models via Cryptographic Transformer CircuitsCode1
Improved Techniques for Optimization-Based Jailbreaking on Large Language ModelsCode2
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Learning diverse attacks on large language models for robust red-teaming and safety tuningCode1
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign UsersCode1
Safety Alignment for Vision Language Models0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Aloe: A Family of Fine-tuned Open Healthcare LLMsCode1
Probabilistic Inference in Language Models via Twisted Sequential Monte CarloCode1
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMsCode2
CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge0
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red TeamingCode2
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
Red-Teaming Segment Anything ModelCode0
Against The Achilles' Heel: A Survey on Red Teaming for Generative ModelsCode2
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI0
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Aligners: Decoupling LLMs and AlignmentCode0
A Safe Harbor for AI Evaluation and Red Teaming0
Curiosity-driven Red-teaming for Large Language ModelsCode2
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning0
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastCode2
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust RefusalCode4
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Gradient-Based Language Model Red TeamingCode0
Towards Red Teaming in Multimodal and Multilingual Translation0
Red-Teaming for Generative AI: Silver Bullet or Security Theater?0
Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread0
Red Teaming Visual Language Models0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Show:102550
← PrevPage 4 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified