| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nov 16, 2023 | Safety Alignment | —Unverified | 0 |
| RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Nov 16, 2023 | Backdoor AttackData Poisoning | —Unverified | 0 |
| How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities | Nov 15, 2023 | EthicsFairness | CodeCode Available | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 |