| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| sudoLLM : On Multi-role Alignment of Language Models | May 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| SafeVid: Toward Safety Aligned Video Large Multimodal Models | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| Noise Injection Systemically Degrades Large Language Model Safety Guardrails | May 16, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | May 16, 2025 | Contrastive LearningSafety Alignment | CodeCode Available | 2 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | May 15, 2025 | Malware DetectionSafety Alignment | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | May 12, 2025 | Code GenerationSafety Alignment | —Unverified | 0 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 |
| Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | May 10, 2025 | Safety Alignment | CodeCode Available | 0 |
| NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models | Apr 29, 2025 | Safety Alignment | —Unverified | 0 |
| SAGE: A Generic Framework for LLM Safety Evaluation | Apr 28, 2025 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift | Apr 28, 2025 | AttributeData Poisoning | —Unverified | 0 |
| DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models | Apr 25, 2025 | DisentanglementSafety Alignment | CodeCode Available | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 |
| Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization | Apr 19, 2025 | Contrastive LearningImage Generation | —Unverified | 0 |
| Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models | Apr 18, 2025 | Safety Alignment | —Unverified | 0 |
| VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization | Apr 17, 2025 | Multimodal ReasoningSafety Alignment | —Unverified | 0 |
| X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Apr 15, 2025 | DiversityRed Teaming | —Unverified | 0 |
| RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Apr 14, 2025 | Safety Alignment | —Unverified | 0 |
| Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? | Apr 14, 2025 | Safety Alignment | —Unverified | 0 |
| LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models | Apr 14, 2025 | Persuasion StrategiesSafety Alignment | —Unverified | 0 |
| AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender | Apr 13, 2025 | Safety Alignment | CodeCode Available | 1 |
| LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation | Apr 10, 2025 | Code GenerationContinual Learning | CodeCode Available | 2 |
| SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Apr 9, 2025 | Safety Alignment | —Unverified | 0 |
| More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Apr 3, 2025 | ARCHellaSwag | —Unverified | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 |
| STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | Apr 2, 2025 | DiversitySafety Alignment | —Unverified | 0 |
| Effectively Controlling Reasoning Models through Thinking Intervention | Mar 31, 2025 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| VPO: Aligning Text-to-Video Generation Models with Prompt Optimization | Mar 26, 2025 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| sudo rm -rf agentic_security | Mar 26, 2025 | Adversarial AttackAI and Safety | CodeCode Available | 1 |
| LookAhead Tuning: Safer Language Models via Partial Answer Previews | Mar 24, 2025 | PositionSafety Alignment | CodeCode Available | 1 |
| Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Mar 22, 2025 | MisinformationSafe Reinforcement Learning | —Unverified | 0 |
| SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging | Mar 21, 2025 | GSM8KSafety Alignment | CodeCode Available | 1 |
| Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification | Mar 14, 2025 | Safety Alignment | —Unverified | 0 |
| Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model | Mar 13, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 |
| Backtracking for Safety | Mar 11, 2025 | Safety Alignment | —Unverified | 0 |
| Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs | Mar 10, 2025 | Binary ClassificationSafety Alignment | —Unverified | 0 |
| SafeArena: Evaluating the Safety of Autonomous Web Agents | Mar 6, 2025 | MisinformationSafety Alignment | —Unverified | 0 |