| StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | Feb 17, 2025 | Safety Alignment | CodeCode Available | 0 |
| Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment | Feb 16, 2025 | Safety Alignment | CodeCode Available | 0 |
| VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap | Feb 14, 2025 | AttributeSafety Alignment | —Unverified | 0 |
| Trustworthy AI: Safety, Bias, and Privacy -- A Survey | Feb 11, 2025 | Safety AlignmentSurvey | —Unverified | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions | Feb 8, 2025 | Safety Alignment | —Unverified | 0 |
| Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing | Feb 4, 2025 | Safety Alignment | —Unverified | 0 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |