| X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Apr 15, 2025 | DiversityRed Teaming | —Unverified | 0 |
| LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models | Apr 14, 2025 | Persuasion StrategiesSafety Alignment | —Unverified | 0 |
| RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Apr 14, 2025 | Safety Alignment | —Unverified | 0 |
| Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? | Apr 14, 2025 | Safety Alignment | —Unverified | 0 |
| SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Apr 9, 2025 | Safety Alignment | —Unverified | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 |
| More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Apr 3, 2025 | ARCHellaSwag | —Unverified | 0 |
| STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | Apr 2, 2025 | DiversitySafety Alignment | —Unverified | 0 |
| Effectively Controlling Reasoning Models through Thinking Intervention | Mar 31, 2025 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Mar 22, 2025 | MisinformationSafe Reinforcement Learning | —Unverified | 0 |
| Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification | Mar 14, 2025 | Safety Alignment | —Unverified | 0 |
| Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model | Mar 13, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 |
| Backtracking for Safety | Mar 11, 2025 | Safety Alignment | —Unverified | 0 |
| Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs | Mar 10, 2025 | Binary ClassificationSafety Alignment | —Unverified | 0 |
| SafeArena: Evaluating the Safety of Autonomous Web Agents | Mar 6, 2025 | MisinformationSafety Alignment | —Unverified | 0 |
| Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety | Mar 6, 2025 | Decision MakingSafety Alignment | —Unverified | 0 |
| SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning | Mar 5, 2025 | Safe Reinforcement LearningSafety Alignment | —Unverified | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 |
| Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh | Mar 3, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Feb 28, 2025 | Safety Alignment | —Unverified | 0 |
| The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence | Feb 24, 2025 | Safety Alignment | —Unverified | 0 |
| Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| C3AI: Crafting and Evaluating Constitutions for Constitutional AI | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Feb 19, 2025 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 |
| Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region | Feb 19, 2025 | Decision MakingSafety Alignment | —Unverified | 0 |
| SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings | Feb 18, 2025 | GPUSafety Alignment | CodeCode Available | 0 |
| Understanding and Rectifying Safety Perception Distortion in VLMs | Feb 18, 2025 | DisentanglementSafety Alignment | —Unverified | 0 |
| DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Feb 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 |
| StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | Feb 17, 2025 | Safety Alignment | CodeCode Available | 0 |
| Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment | Feb 16, 2025 | Safety Alignment | CodeCode Available | 0 |
| VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap | Feb 14, 2025 | AttributeSafety Alignment | —Unverified | 0 |
| Trustworthy AI: Safety, Bias, and Privacy -- A Survey | Feb 11, 2025 | Safety AlignmentSurvey | —Unverified | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions | Feb 8, 2025 | Safety Alignment | —Unverified | 0 |
| Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing | Feb 4, 2025 | Safety Alignment | —Unverified | 0 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| LLM Safety Alignment is Divergence Estimation in Disguise | Feb 2, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning | Jan 31, 2025 | BlockingSafety Alignment | —Unverified | 0 |
| Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare | Jan 27, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models | Jan 23, 2025 | Safety Alignment | —Unverified | 0 |
| Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Jan 18, 2025 | Safety Alignment | CodeCode Available | 0 |
| PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models | Jan 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation | Jan 3, 2025 | parameter-efficient fine-tuningSafety Alignment | —Unverified | 0 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models | Jan 1, 2025 | Safety Alignment | —Unverified | 0 |
| SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage | Dec 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Dec 15, 2024 | Safety Alignment | CodeCode Available | 0 |