| Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation | Oct 13, 2024 | Safety AlignmentTAR | CodeCode Available | 1 |
| AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Oct 11, 2024 | Safety Alignment | CodeCode Available | 1 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 |
| Course-Correction: Safety Alignment Using Synthetic Preferences | Jul 23, 2024 | Safety Alignment | CodeCode Available | 1 |
| Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation | Jul 4, 2024 | Q-Learningreinforcement-learning | CodeCode Available | 1 |
| From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Jul 3, 2024 | Safety Alignment | CodeCode Available | 1 |
| SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset | Jun 20, 2024 | Safety AlignmentText-to-Video Generation | CodeCode Available | 1 |
| SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models | Jun 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Jun 17, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 1 |
| Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations | Jun 17, 2024 | AI and SafetyQuestion Answering | CodeCode Available | 1 |
| OR-Bench: An Over-Refusal Benchmark for Large Language Models | May 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack | May 28, 2024 | Safety Alignment | CodeCode Available | 1 |
| Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | May 27, 2024 | Safety Alignment | CodeCode Available | 1 |
| PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | May 13, 2024 | Safety Alignment | CodeCode Available | 1 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 |
| Uncovering Safety Risks of Large Language Models through Concept Activation Vector | Apr 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 |
| Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Feb 22, 2024 | Backdoor AttackLanguage Modelling | CodeCode Available | 1 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Feb 14, 2024 | Adversarial RobustnessSafety Alignment | CodeCode Available | 1 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Oct 23, 2023 | Adversarial AttackBlocking | CodeCode Available | 1 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 |
| TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Jul 8, 2025 | ChatbotInstruction Following | —Unverified | 0 |
| Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Jul 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 |
| Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Jun 23, 2025 | Mixture-of-ExpertsSafety Alignment | —Unverified | 0 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 |
| SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Jun 20, 2025 | Mixture-of-ExpertsResponse Generation | —Unverified | 0 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs | Jun 16, 2025 | DiversityModel Editing | CodeCode Available | 0 |
| SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Jun 15, 2025 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 |
| AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Jun 10, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation | Jun 9, 2025 | Safety Alignment | —Unverified | 0 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 |
| Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Jun 5, 2025 | Safety Alignment | —Unverified | 0 |