| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 | 5 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Jun 17, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 1 | 5 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 | 5 |
| Course-Correction: Safety Alignment Using Synthetic Preferences | Jul 23, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 | 5 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |