SOTAVerified

Red Teaming

Papers

Showing 126150 of 251 papers

TitleStatusHype
LLM-Safety Evaluations Lack Robustness0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Fast Proxies for LLM Robustness Evaluation0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
Predictive Red Teaming: Breaking Policies Without Breaking Robots0
KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models0
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints0
Lessons From Red Teaming 100 Generative AI Products0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning0
OpenAI o1 System Card0
POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI0
AI red-teaming is a sociotechnical challenge: on values, labor, and harms0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
Embodied Red Teaming for Auditing Robotic Foundation Models0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models0
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs0
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
Show:102550
← PrevPage 6 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified