| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Model Merging and Safety Alignment: One Bad Model Spoils the Bunch | Jun 20, 2024 | modelSafety Alignment | —Unverified | 0 |
| PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference | Jun 20, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models | Jun 12, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Jun 8, 2024 | Adversarial AttackLLM Jailbreak | —Unverified | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks | May 25, 2024 | Safety Alignment | —Unverified | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching | May 22, 2024 | Safety Alignment | —Unverified | 0 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 |
| CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues | Apr 4, 2024 | ChatbotInstruction Following | —Unverified | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 |