| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Model Merging and Safety Alignment: One Bad Model Spoils the Bunch | Jun 20, 2024 | modelSafety Alignment | —Unverified | 0 |
| PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference | Jun 20, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models | Jun 12, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Jun 8, 2024 | Adversarial AttackLLM Jailbreak | —Unverified | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks | May 25, 2024 | Safety Alignment | —Unverified | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching | May 22, 2024 | Safety Alignment | —Unverified | 0 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 |
| CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues | Apr 4, 2024 | ChatbotInstruction Following | —Unverified | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nov 16, 2023 | Safety Alignment | —Unverified | 0 |
| RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Nov 16, 2023 | Backdoor AttackData Poisoning | —Unverified | 0 |
| How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities | Nov 15, 2023 | EthicsFairness | CodeCode Available | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 |
| Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Oct 16, 2023 | Adversarial AttackFederated Learning | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Oct 4, 2023 | GPUSafety Alignment | —Unverified | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 |
| Model Card and Evaluations for Claude Models | Jul 11, 2023 | Arithmetic ReasoningBug fixing | —Unverified | 0 |
| Off-Policy Risk Assessment in Markov Decision Processes | Sep 21, 2022 | Multi-Armed BanditsSafety Alignment | —Unverified | 0 |