| CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion | Mar 12, 2024 | Code CompletionSafety Alignment | CodeCode Available | 2 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 |
| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 |
| DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Feb 25, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 2 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 |
| Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Feb 22, 2024 | Backdoor AttackLanguage Modelling | CodeCode Available | 1 |
| Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning | Feb 21, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Feb 19, 2024 | Safety Alignment | CodeCode Available | 2 |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Feb 14, 2024 | Adversarial RobustnessSafety Alignment | CodeCode Available | 1 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | Feb 3, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 2 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Nov 16, 2023 | Backdoor AttackData Poisoning | —Unverified | 0 |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nov 16, 2023 | Safety Alignment | —Unverified | 0 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities | Nov 15, 2023 | EthicsFairness | CodeCode Available | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Oct 23, 2023 | Adversarial AttackBlocking | CodeCode Available | 1 |
| Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Oct 16, 2023 | Adversarial AttackFederated Learning | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Oct 4, 2023 | GPUSafety Alignment | —Unverified | 0 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Aug 12, 2023 | EthicsRed Teaming | CodeCode Available | 2 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 |
| Model Card and Evaluations for Claude Models | Jul 11, 2023 | Arithmetic ReasoningBug fixing | —Unverified | 0 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 |
| Off-Policy Risk Assessment in Markov Decision Processes | Sep 21, 2022 | Multi-Armed BanditsSafety Alignment | —Unverified | 0 |