| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| sudoLLM : On Multi-role Alignment of Language Models | May 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |