| Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation | Oct 13, 2024 | Safety AlignmentTAR | CodeCode Available | 1 |
| AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Oct 11, 2024 | Safety Alignment | CodeCode Available | 1 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 |
| Course-Correction: Safety Alignment Using Synthetic Preferences | Jul 23, 2024 | Safety Alignment | CodeCode Available | 1 |
| Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation | Jul 4, 2024 | Q-Learningreinforcement-learning | CodeCode Available | 1 |
| From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Jul 3, 2024 | Safety Alignment | CodeCode Available | 1 |