| Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Jun 10, 2024 | Safety Alignment | CodeCode Available | 2 |
| How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Jun 9, 2024 | Safety Alignment | CodeCode Available | 2 |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Apr 11, 2024 | Safety Alignment | CodeCode Available | 2 |
| CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion | Mar 12, 2024 | Code CompletionSafety Alignment | CodeCode Available | 2 |
| DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Feb 25, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 2 |
| Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning | Feb 21, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Feb 19, 2024 | Safety Alignment | CodeCode Available | 2 |
| Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | Feb 3, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 2 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Aug 12, 2023 | EthicsRed Teaming | CodeCode Available | 2 |