| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 | 0 |
| SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Apr 9, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Mar 22, 2025 | MisinformationSafe Reinforcement Learning | —Unverified | 0 | 0 |
| Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Nov 30, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 | 0 |
| SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation | Dec 13, 2024 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety | Mar 6, 2025 | Decision MakingSafety Alignment | —Unverified | 0 | 0 |