| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Apr 9, 2025 | Safety Alignment | —Unverified | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 |
| Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Mar 22, 2025 | MisinformationSafe Reinforcement Learning | —Unverified | 0 |
| Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Nov 30, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 |
| SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation | Dec 13, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 |