SOTAVerified

Red Teaming

Papers

Showing 101110 of 251 papers

TitleStatusHype
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
Finding Safety Neurons in Large Language Models0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Fast Proxies for LLM Robustness Evaluation0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts0
Show:102550
← PrevPage 11 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified