| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis | Feb 13, 2025 | Safety Alignment | CodeCode Available | 3 |
| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey | Sep 26, 2024 | Safety Alignment | CodeCode Available | 3 |
| The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Jul 15, 2025 | Code GenerationSafety Alignment | CodeCode Available | 2 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | May 16, 2025 | Contrastive LearningSafety Alignment | CodeCode Available | 2 |
| LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation | Apr 10, 2025 | Code GenerationContinual Learning | CodeCode Available | 2 |
| STAIR: Improving Safety Alignment with Introspective Reasoning | Feb 4, 2025 | Safety Alignment | CodeCode Available | 2 |
| Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Jan 29, 2025 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Cross-Modality Safety Alignment | Jun 21, 2024 | Safety Alignment | CodeCode Available | 2 |