SOTAVerified

Red Teaming

Papers

Showing 76100 of 251 papers

TitleStatusHype
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming0
Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition0
CELL your Model: Contrastive Explanations for Large Language Models0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming0
Can Large Language Models Change User Preference Adversarially?0
A Red Teaming Roadmap Towards System-Level Safety0
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models0
Can Large Language Models Automatically Jailbreak GPT-4V?0
Can Language Models be Instructed to Protect Personal Information?0
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
JAB: Joint Adversarial Prompting and Belief Augmentation0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs0
FLIRT: Feedback Loop In-context Red Teaming0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
Finding Safety Neurons in Large Language Models0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Show:102550
← PrevPage 4 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified