SOTAVerified

Red Teaming

Papers

Showing 201250 of 251 papers

TitleStatusHype
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Towards medical AI misalignment: a preliminary study0
Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework0
Towards Red Teaming in Multimodal and Multilingual Translation0
Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges0
Understanding and Mitigating Risks of Generative AI in Financial Services0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
Soft Prompts for Evaluation: Measuring Conditional Distance of CapabilitiesCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingCode0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
Overriding Safety protections of Open-source ModelsCode0
No Offense Taken: Eliciting Offensiveness from Language ModelsCode0
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL AgentsCode0
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLMCode0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Red-Teaming Segment Anything ModelCode0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive studyCode0
Capability-Based Scaling Laws for LLM Red-TeamingCode0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Distract Large Language Models for Automatic Jailbreak AttackCode0
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree SearchCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
InfoPattern: Unveiling Information Propagation Patterns in Social MediaCode0
RICoTA: Red-teaming of In-the-wild Conversation with Test AttemptsCode0
Gradient-Based Language Model Red TeamingCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
An Auditing Test To Detect Behavioral Shift in Language ModelsCode0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language ModelsCode0
The Structural Safety Generalization ProblemCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
Automated Progressive Red TeamingCode0
Aligners: Decoupling LLMs and AlignmentCode0
We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent SystemsCode0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
Show:102550
← PrevPage 5 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified