SOTAVerified

Red Teaming

Papers

Showing 151200 of 251 papers

TitleStatusHype
LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"0
AdvAgent: Controllable Blackbox Red-teaming on Web Agents0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisCode0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations0
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models0
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester0
Overriding Safety protections of Open-source ModelsCode0
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI0
Jailbreaking Large Language Models with Symbolic Mathematics0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingCode0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
Exploring Straightforward Conversational Red-Teaming0
Conversational Complexity for Assessing Risk in Large Language Models0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Atoxia: Red-teaming Large Language Models with Target Toxic Answers0
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree SearchCode0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
Automated Progressive Red TeamingCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Purple-teaming LLMs with Adversarial Defender Training0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Finding Safety Neurons in Large Language Models0
Adversaries Can Misuse Combinations of Safe Models0
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
CELL your Model: Contrastive Explanations for Large Language Models0
STAR: SocioTechnical Approach to Red Teaming Language Models0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Safety Alignment for Vision Language Models0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Show:102550
← PrevPage 4 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified