| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 |
| sudoLLM : On Multi-role Alignment of Language Models | May 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| SafeVid: Toward Safety Aligned Video Large Multimodal Models | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| Noise Injection Systemically Degrades Large Language Model Safety Guardrails | May 16, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | May 16, 2025 | Contrastive LearningSafety Alignment | CodeCode Available | 2 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | May 15, 2025 | Malware DetectionSafety Alignment | —Unverified | 0 |
| One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | May 12, 2025 | Code GenerationSafety Alignment | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 |
| Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | May 10, 2025 | Safety Alignment | CodeCode Available | 0 |
| NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models | Apr 29, 2025 | Safety Alignment | —Unverified | 0 |
| SAGE: A Generic Framework for LLM Safety Evaluation | Apr 28, 2025 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift | Apr 28, 2025 | AttributeData Poisoning | —Unverified | 0 |
| DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models | Apr 25, 2025 | DisentanglementSafety Alignment | CodeCode Available | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 |