| Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Dec 10, 2024 | Safety Alignment | —Unverified | 0 |
| SafeWorld: Geo-Diverse Safety Alignment | Dec 9, 2024 | Safety Alignment | CodeCode Available | 0 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Nov 30, 2024 | Safety Alignment | —Unverified | 0 |
| PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning | Nov 28, 2024 | Federated Learningparameter-efficient fine-tuning | —Unverified | 0 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Nov 27, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| Don't Command, Cultivate: An Exploratory Study of System-2 Alignment | Nov 26, 2024 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 |
| Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Nov 20, 2024 | FairnessSafety Alignment | —Unverified | 0 |
| PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment | Nov 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Playing Language Game with LLMs Leads to Jailbreaking | Nov 16, 2024 | Safety Alignment | —Unverified | 0 |
| Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models | Nov 6, 2024 | Safety Alignment | —Unverified | 0 |
| Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Nov 5, 2024 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Code-Switching Curriculum Learning for Multilingual Transfer in LLMs | Nov 4, 2024 | Cross-Lingual TransferLanguage Acquisition | —Unverified | 0 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Smaller Large Language Models Can Do Moral Self-Correction | Oct 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization | Oct 25, 2024 | Safety Alignment | CodeCode Available | 0 |
| Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Oct 24, 2024 | Safety Alignment | CodeCode Available | 0 |
| Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks | Oct 23, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement | Oct 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SPIN: Self-Supervised Prompt INjection | Oct 17, 2024 | Safety Alignment | —Unverified | 0 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |
| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation | Oct 13, 2024 | Safety AlignmentTAR | CodeCode Available | 1 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Oct 11, 2024 | Safety Alignment | CodeCode Available | 1 |
| Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Superficial Safety Alignment Hypothesis | Oct 7, 2024 | AttributeBinary Classification | —Unverified | 0 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Toxic Subword Pruning for Dialogue Response Generation on Large Language Models | Oct 5, 2024 | Language ModellingMachine Translation | —Unverified | 0 |
| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Towards Inference-time Category-wise Safety Steering for Large Language Models | Oct 2, 2024 | Safety Alignment | —Unverified | 0 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey | Sep 26, 2024 | Safety Alignment | CodeCode Available | 3 |
| Backtracking Improves Generation Safety | Sep 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Sep 21, 2024 | Multi-agent Reinforcement LearningSafety Alignment | —Unverified | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 |
| Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer | Aug 21, 2024 | Safety Alignment | CodeCode Available | 0 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |
| Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Aug 20, 2024 | Safety Alignment | —Unverified | 0 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 |
| SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Aug 14, 2024 | Red TeamingSafety Alignment | —Unverified | 0 |
| Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Aug 14, 2024 | Safety Alignment | CodeCode Available | 0 |