| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models | Jun 9, 2025 | Multi-agent Reinforcement LearningSafety Alignment | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 |