| The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Jul 15, 2025 | Code GenerationSafety Alignment | CodeCode Available | 2 |
| TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Jul 8, 2025 | ChatbotInstruction Following | —Unverified | 0 |
| Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Jul 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 |
| Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Jun 23, 2025 | Mixture-of-ExpertsSafety Alignment | —Unverified | 0 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 |
| SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Jun 20, 2025 | Mixture-of-ExpertsResponse Generation | —Unverified | 0 |
| Probing the Robustness of Large Language Models Safety to Latent Perturbations | Jun 19, 2025 | DiagnosticSafety Alignment | CodeCode Available | 1 |
| Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Jun 19, 2025 | Large Language ModelSafety Alignment | CodeCode Available | 1 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |