| EnJa: Ensemble Jailbreak on Large Language Models | Aug 7, 2024 | Safety Alignment | —Unverified | 0 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Course-Correction: Safety Alignment Using Synthetic Preferences | Jul 23, 2024 | Safety Alignment | CodeCode Available | 1 |
| Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Jul 21, 2024 | Instruction FollowingLanguage Modelling | —Unverified | 0 |
| The Better Angels of Machine Personality: How Personality Relates to LLM Safety | Jul 17, 2024 | FairnessSafety Alignment | CodeCode Available | 0 |
| Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture | Jul 10, 2024 | Safety Alignment | —Unverified | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 |
| Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation | Jul 4, 2024 | Q-Learningreinforcement-learning | CodeCode Available | 1 |
| From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Jul 3, 2024 | Safety Alignment | CodeCode Available | 1 |
| LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models | Jul 3, 2024 | Safety Alignment | —Unverified | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 |
| Cross-Modality Safety Alignment | Jun 21, 2024 | Safety Alignment | CodeCode Available | 2 |
| PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference | Jun 20, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Model Merging and Safety Alignment: One Bad Model Spoils the Bunch | Jun 20, 2024 | modelSafety Alignment | —Unverified | 0 |
| SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset | Jun 20, 2024 | Safety AlignmentText-to-Video Generation | CodeCode Available | 1 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models | Jun 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Jun 17, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 1 |
| Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations | Jun 17, 2024 | AI and SafetyQuestion Answering | CodeCode Available | 1 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models | Jun 12, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Jun 10, 2024 | Safety Alignment | CodeCode Available | 2 |
| How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Jun 9, 2024 | Safety Alignment | CodeCode Available | 2 |
| SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Jun 8, 2024 | Adversarial AttackLLM Jailbreak | —Unverified | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| OR-Bench: An Over-Refusal Benchmark for Large Language Models | May 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 |
| Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack | May 28, 2024 | Safety Alignment | CodeCode Available | 1 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | May 27, 2024 | Safety Alignment | CodeCode Available | 1 |
| No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks | May 25, 2024 | Safety Alignment | —Unverified | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 |
| Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching | May 22, 2024 | Safety Alignment | —Unverified | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 |
| Safety Alignment for Vision Language Models | May 22, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | May 13, 2024 | Safety Alignment | CodeCode Available | 1 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 |
| Uncovering Safety Risks of Large Language Models through Concept Activation Vector | Apr 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Apr 11, 2024 | Safety Alignment | CodeCode Available | 2 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 |
| CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues | Apr 4, 2024 | ChatbotInstruction Following | —Unverified | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |