SOTAVerified

Red Teaming

Papers

Showing 76100 of 251 papers

TitleStatusHype
Soft Prompts for Evaluation: Measuring Conditional Distance of CapabilitiesCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Red-Teaming Segment Anything ModelCode0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical SynthesisCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Aligners: Decoupling LLMs and AlignmentCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Show:102550
← PrevPage 4 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified