| From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Jul 3, 2024 | Safety Alignment | CodeCode Available | 1 |
| LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models | Jul 3, 2024 | Safety Alignment | —Unverified | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 |
| Cross-Modality Safety Alignment | Jun 21, 2024 | Safety Alignment | CodeCode Available | 2 |
| PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference | Jun 20, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| Model Merging and Safety Alignment: One Bad Model Spoils the Bunch | Jun 20, 2024 | modelSafety Alignment | —Unverified | 0 |
| SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset | Jun 20, 2024 | Safety AlignmentText-to-Video Generation | CodeCode Available | 1 |