| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 |
| Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary | May 23, 2025 | Safety Alignment | —Unverified | 0 |
| Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey | May 23, 2025 | Active LearningReinforcement Learning (RL) | —Unverified | 0 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 |
| Shape it Up! Restoring LLM Safety during Finetuning | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming | May 22, 2025 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 |
| CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |