| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation | Oct 13, 2024 | Safety AlignmentTAR | CodeCode Available | 1 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Oct 11, 2024 | Safety Alignment | CodeCode Available | 1 |
| Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Superficial Safety Alignment Hypothesis | Oct 7, 2024 | AttributeBinary Classification | —Unverified | 0 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Toxic Subword Pruning for Dialogue Response Generation on Large Language Models | Oct 5, 2024 | Language ModellingMachine Translation | —Unverified | 0 |
| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Towards Inference-time Category-wise Safety Steering for Large Language Models | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey | Sep 26, 2024 | Safety Alignment | CodeCode Available | 3 |
| Backtracking Improves Generation Safety | Sep 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Sep 21, 2024 | Multi-agent Reinforcement LearningSafety Alignment | —Unverified | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 |
| Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer | Aug 21, 2024 | Safety Alignment | CodeCode Available | 0 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Aug 20, 2024 | Safety Alignment | —Unverified | 0 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Aug 14, 2024 | Safety Alignment | CodeCode Available | 0 |