| Probing the Robustness of Large Language Models Safety to Latent Perturbations | Jun 19, 2025 | DiagnosticSafety Alignment | CodeCode Available | 1 |
| Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Jun 19, 2025 | Large Language ModelSafety Alignment | CodeCode Available | 1 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 |
| Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models | Jun 9, 2025 | Multi-agent Reinforcement LearningSafety Alignment | CodeCode Available | 1 |
| RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jun 9, 2025 | Safety Alignment | CodeCode Available | 1 |
| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 |
| MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming | May 22, 2025 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |