| Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching | May 22, 2024 | Safety Alignment | —Unverified | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | May 13, 2024 | Safety Alignment | CodeCode Available | 1 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 |
| Uncovering Safety Risks of Large Language Models through Concept Activation Vector | Apr 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Apr 11, 2024 | Safety Alignment | CodeCode Available | 2 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 |
| CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues | Apr 4, 2024 | ChatbotInstruction Following | —Unverified | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |