| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions | Feb 8, 2025 | Safety Alignment | —Unverified | 0 |
| Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions | Feb 6, 2025 | Safety Alignment | CodeCode Available | 1 |
| Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing | Feb 4, 2025 | Safety Alignment | —Unverified | 0 |
| STAIR: Improving Safety Alignment with Introspective Reasoning | Feb 4, 2025 | Safety Alignment | CodeCode Available | 2 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| LLM Safety Alignment is Divergence Estimation in Disguise | Feb 2, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning | Jan 31, 2025 | BlockingSafety Alignment | —Unverified | 0 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 |
| Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Jan 29, 2025 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking | Jan 28, 2025 | Reinforcement Learning (RL)Safety Alignment | CodeCode Available | 1 |
| Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare | Jan 27, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models | Jan 23, 2025 | Safety Alignment | —Unverified | 0 |
| Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Jan 18, 2025 | Safety Alignment | CodeCode Available | 0 |
| PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models | Jan 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation | Jan 3, 2025 | parameter-efficient fine-tuningSafety Alignment | —Unverified | 0 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models | Jan 1, 2025 | Safety Alignment | —Unverified | 0 |
| SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage | Dec 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Dec 15, 2024 | Safety Alignment | CodeCode Available | 0 |
| No Free Lunch for Defending Against Prefilling Attack by In-Context Learning | Dec 13, 2024 | In-Context LearningSafety Alignment | —Unverified | 0 |
| SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation | Dec 13, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| Model-Editing-Based Jailbreak against Safety-aligned Large Language Models | Dec 11, 2024 | Model EditingSafety Alignment | —Unverified | 0 |