| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 | 0 |
| SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Apr 9, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Mar 22, 2025 | MisinformationSafe Reinforcement Learning | —Unverified | 0 | 0 |
| Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Nov 30, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 | 0 |
| SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation | Dec 13, 2024 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety | Mar 6, 2025 | Decision MakingSafety Alignment | —Unverified | 0 | 0 |
| SafeVid: Toward Safety Aligned Video Large Multimodal Models | May 17, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning | Mar 5, 2025 | Safe Reinforcement LearningSafety Alignment | —Unverified | 0 | 0 |
| SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Jun 20, 2025 | Mixture-of-ExpertsResponse Generation | —Unverified | 0 | 0 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation | Jan 3, 2025 | parameter-efficient fine-tuningSafety Alignment | —Unverified | 0 | 0 |
| SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Oct 2, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Jun 23, 2025 | Mixture-of-ExpertsSafety Alignment | —Unverified | 0 | 0 |
| SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Jun 15, 2025 | LLM JailbreakSafety Alignment | —Unverified | 0 | 0 |
| Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | May 28, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 | 0 |
| SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Jun 8, 2024 | Adversarial AttackLLM Jailbreak | —Unverified | 0 | 0 |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Oct 4, 2023 | GPUSafety Alignment | —Unverified | 0 | 0 |
| Shape it Up! Restoring LLM Safety during Finetuning | May 22, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Smaller Large Language Models Can Do Moral Self-Correction | Oct 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models | Jan 1, 2025 | Safety Alignment | —Unverified | 0 | 0 |