| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Jun 10, 2024 | Safety Alignment | CodeCode Available | 2 |
| Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning | Feb 21, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| STAIR: Improving Safety Alignment with Introspective Reasoning | Feb 4, 2025 | Safety Alignment | CodeCode Available | 2 |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Aug 12, 2023 | EthicsRed Teaming | CodeCode Available | 2 |
| ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Feb 19, 2024 | Safety Alignment | CodeCode Available | 2 |
| DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Feb 25, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 2 |
| CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion | Mar 12, 2024 | Code CompletionSafety Alignment | CodeCode Available | 2 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Jan 29, 2025 | Red TeamingSafety Alignment | CodeCode Available | 2 |