| SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Jun 20, 2025 | Mixture-of-ExpertsResponse Generation | —Unverified | 0 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs | Jun 16, 2025 | DiversityModel Editing | CodeCode Available | 0 |
| SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Jun 15, 2025 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 |
| AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Jun 10, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation | Jun 9, 2025 | Safety Alignment | —Unverified | 0 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 |
| Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Jun 5, 2025 | Safety Alignment | —Unverified | 0 |