SOTAVerified

Red Teaming

Papers

Showing 101125 of 251 papers

TitleStatusHype
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
FLIRT: Feedback Loop In-context Red Teaming0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
Lessons From Red Teaming 100 Generative AI Products0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
JAB: Joint Adversarial Prompting and Belief Augmentation0
Finding Safety Neurons in Large Language Models0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Fast Proxies for LLM Robustness Evaluation0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges0
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Show:102550
← PrevPage 5 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified