| The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Jul 15, 2025 | Code GenerationSafety Alignment | CodeCode Available | 2 |
| TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Jul 8, 2025 | ChatbotInstruction Following | —Unverified | 0 |
| Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Jul 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 |
| Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Jun 23, 2025 | Mixture-of-ExpertsSafety Alignment | —Unverified | 0 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 |
| SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Jun 20, 2025 | Mixture-of-ExpertsResponse Generation | —Unverified | 0 |
| Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Jun 19, 2025 | Large Language ModelSafety Alignment | CodeCode Available | 1 |
| Probing the Robustness of Large Language Models Safety to Latent Perturbations | Jun 19, 2025 | DiagnosticSafety Alignment | CodeCode Available | 1 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs | Jun 16, 2025 | DiversityModel Editing | CodeCode Available | 0 |
| SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Jun 15, 2025 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 |
| AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Jun 10, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation | Jun 9, 2025 | Safety Alignment | —Unverified | 0 |
| RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jun 9, 2025 | Safety Alignment | CodeCode Available | 1 |
| Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models | Jun 9, 2025 | Multi-agent Reinforcement LearningSafety Alignment | CodeCode Available | 1 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 |
| Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Jun 5, 2025 | Safety Alignment | —Unverified | 0 |
| Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning | Jun 4, 2025 | Safety Alignment | —Unverified | 0 |
| DiaBlo: Diagonal Blocks Are Sufficient For Finetuning | Jun 3, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 |
| BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Jun 3, 2025 | Prompt EngineeringRed Teaming | CodeCode Available | 0 |
| Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models | Jun 2, 2025 | Safety Alignment | —Unverified | 0 |