SOTAVerified

Red Teaming

Papers

Showing 176200 of 251 papers

TitleStatusHype
Against The Achilles' Heel: A Survey on Red Teaming for Generative ModelsCode2
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI0
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Aligners: Decoupling LLMs and AlignmentCode0
A Safe Harbor for AI Evaluation and Red Teaming0
Curiosity-driven Red-teaming for Large Language ModelsCode2
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning0
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image GenerationCode1
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially FastCode2
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust RefusalCode4
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Gradient-Based Language Model Red TeamingCode0
Towards Red Teaming in Multimodal and Multilingual Translation0
Red-Teaming for Generative AI: Silver Bullet or Security Theater?0
Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread0
Red Teaming Visual Language Models0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Causality Analysis for Evaluating the Security of Large Language ModelsCode1
AI Control: Improving Safety Despite Intentional SubversionCode1
Control Risk for Potential Misuse of Artificial Intelligence in ScienceCode1
Show:102550
← PrevPage 8 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified