| SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Jun 8, 2024 | Adversarial AttackLLM Jailbreak | —Unverified | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| OR-Bench: An Over-Refusal Benchmark for Large Language Models | May 31, 2024 | Safety Alignment | CodeCode Available | 1 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 |
| Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack | May 28, 2024 | Safety Alignment | CodeCode Available | 1 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | May 27, 2024 | Safety Alignment | CodeCode Available | 1 |
| No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks | May 25, 2024 | Safety Alignment | —Unverified | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 |