| SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation | Dec 13, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| No Free Lunch for Defending Against Prefilling Attack by In-Context Learning | Dec 13, 2024 | In-Context LearningSafety Alignment | —Unverified | 0 |
| Model-Editing-Based Jailbreak against Safety-aligned Large Language Models | Dec 11, 2024 | Model EditingSafety Alignment | —Unverified | 0 |
| Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Dec 10, 2024 | Safety Alignment | —Unverified | 0 |
| SafeWorld: Geo-Diverse Safety Alignment | Dec 9, 2024 | Safety Alignment | CodeCode Available | 0 |
| Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Nov 30, 2024 | Safety Alignment | —Unverified | 0 |
| PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning | Nov 28, 2024 | Federated Learningparameter-efficient fine-tuning | —Unverified | 0 |
| Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Nov 27, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| Don't Command, Cultivate: An Exploratory Study of System-2 Alignment | Nov 26, 2024 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 |
| Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Nov 20, 2024 | FairnessSafety Alignment | —Unverified | 0 |
| PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment | Nov 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Playing Language Game with LLMs Leads to Jailbreaking | Nov 16, 2024 | Safety Alignment | —Unverified | 0 |
| Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models | Nov 6, 2024 | Safety Alignment | —Unverified | 0 |
| Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Nov 5, 2024 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Code-Switching Curriculum Learning for Multilingual Transfer in LLMs | Nov 4, 2024 | Cross-Lingual TransferLanguage Acquisition | —Unverified | 0 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Smaller Large Language Models Can Do Moral Self-Correction | Oct 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization | Oct 25, 2024 | Safety Alignment | CodeCode Available | 0 |
| Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Oct 24, 2024 | Safety Alignment | CodeCode Available | 0 |
| Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks | Oct 23, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement | Oct 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SPIN: Self-Supervised Prompt INjection | Oct 17, 2024 | Safety Alignment | —Unverified | 0 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Superficial Safety Alignment Hypothesis | Oct 7, 2024 | AttributeBinary Classification | —Unverified | 0 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Toxic Subword Pruning for Dialogue Response Generation on Large Language Models | Oct 5, 2024 | Language ModellingMachine Translation | —Unverified | 0 |
| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Towards Inference-time Category-wise Safety Steering for Large Language Models | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Backtracking Improves Generation Safety | Sep 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Sep 21, 2024 | Multi-agent Reinforcement LearningSafety Alignment | —Unverified | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 |
| Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer | Aug 21, 2024 | Safety Alignment | CodeCode Available | 0 |
| Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Aug 20, 2024 | Safety Alignment | —Unverified | 0 |
| Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Aug 14, 2024 | Safety Alignment | CodeCode Available | 0 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| EnJa: Ensemble Jailbreak on Large Language Models | Aug 7, 2024 | Safety Alignment | —Unverified | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Jul 21, 2024 | Instruction FollowingLanguage Modelling | —Unverified | 0 |
| The Better Angels of Machine Personality: How Personality Relates to LLM Safety | Jul 17, 2024 | FairnessSafety Alignment | CodeCode Available | 0 |
| Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture | Jul 10, 2024 | Safety Alignment | —Unverified | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 |
| LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models | Jul 3, 2024 | Safety Alignment | —Unverified | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 |