SOTAVerified

Red Teaming

Papers

Showing 76100 of 251 papers

TitleStatusHype
Gandalf the Red: Adaptive Security for LLMsCode1
Lessons From Red Teaming 100 Generative AI Products0
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning0
OpenAI o1 System Card0
POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI0
AI red-teaming is a sociotechnical challenge: on values, labor, and harms0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
Embodied Red Teaming for Auditing Robotic Foundation Models0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models0
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMsCode1
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
AdvAgent: Controllable Blackbox Red-teaming on Web Agents0
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsCode1
Show:102550
← PrevPage 4 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified