| Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety | Mar 6, 2025 | Decision MakingSafety Alignment | —Unverified | 0 |
| Improving LLM Safety Alignment with Dual-Objective Optimization | Mar 5, 2025 | Safety Alignment | CodeCode Available | 1 |
| SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning | Mar 5, 2025 | Safe Reinforcement LearningSafety Alignment | —Unverified | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 |
| Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh | Mar 3, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks | Feb 28, 2025 | Safety Alignment | CodeCode Available | 1 |
| FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Feb 28, 2025 | Safety Alignment | —Unverified | 0 |
| The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence | Feb 24, 2025 | Safety Alignment | —Unverified | 0 |
| Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| C3AI: Crafting and Evaluating Constitutions for Constitutional AI | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Feb 19, 2025 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 |
| Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region | Feb 19, 2025 | Decision MakingSafety Alignment | —Unverified | 0 |
| Understanding and Rectifying Safety Perception Distortion in VLMs | Feb 18, 2025 | DisentanglementSafety Alignment | —Unverified | 0 |
| SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings | Feb 18, 2025 | GPUSafety Alignment | CodeCode Available | 0 |
| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 |
| StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | Feb 17, 2025 | Safety Alignment | CodeCode Available | 0 |
| DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Feb 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 |
| Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment | Feb 16, 2025 | Safety Alignment | CodeCode Available | 0 |
| Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Feb 16, 2025 | Safety Alignment | CodeCode Available | 1 |
| X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability | Feb 14, 2025 | Safety Alignment | CodeCode Available | 1 |
| VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap | Feb 14, 2025 | AttributeSafety Alignment | —Unverified | 0 |
| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis | Feb 13, 2025 | Safety Alignment | CodeCode Available | 3 |
| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 |
| Trustworthy AI: Safety, Bias, and Privacy -- A Survey | Feb 11, 2025 | Safety AlignmentSurvey | —Unverified | 0 |