SOTAVerified

Safety Alignment

Papers

Showing 201225 of 288 papers

TitleStatusHype
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
Understanding and Rectifying Safety Perception Distortion in VLMs0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models0
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift0
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response0
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
Show:102550
← PrevPage 9 of 12Next →

No leaderboard results yet.