| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey | Sep 26, 2024 | Safety Alignment | CodeCode Available | 3 |
| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis | Feb 13, 2025 | Safety Alignment | CodeCode Available | 3 |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Aug 12, 2023 | EthicsRed Teaming | CodeCode Available | 2 |
| DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Feb 25, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 2 |
| Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | May 16, 2025 | Contrastive LearningSafety Alignment | CodeCode Available | 2 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Feb 19, 2024 | Safety Alignment | CodeCode Available | 2 |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Apr 11, 2024 | Safety Alignment | CodeCode Available | 2 |
| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Cross-Modality Safety Alignment | Jun 21, 2024 | Safety Alignment | CodeCode Available | 2 |