| X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability | Feb 14, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks | Feb 28, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Feb 16, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations | Jun 17, 2024 | AI and SafetyQuestion Answering | CodeCode Available | 1 | 5 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| sudo rm -rf agentic_security | Mar 26, 2025 | Adversarial AttackAI and Safety | CodeCode Available | 1 | 5 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models | Jun 18, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 | 5 |
| LookAhead Tuning: Safer Language Models via Partial Answer Previews | Mar 24, 2025 | PositionSafety Alignment | CodeCode Available | 1 | 5 |
| SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset | Jun 20, 2024 | Safety AlignmentText-to-Video Generation | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |
| Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions | Feb 6, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 | 5 |
| From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Jul 3, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Improving LLM Safety Alignment with Dual-Objective Optimization | Mar 5, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 | 5 |
| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 | 5 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 | 5 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 | 5 |
| Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack | May 28, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Feb 14, 2024 | Adversarial RobustnessSafety Alignment | CodeCode Available | 1 | 5 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 | 5 |
| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 | 5 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation | Oct 13, 2024 | Safety AlignmentTAR | CodeCode Available | 1 | 5 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 | 5 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| DiaBlo: Diagonal Blocks Are Sufficient For Finetuning | Jun 3, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 | 5 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 | 5 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 | 5 |
| A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement | Oct 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models | May 29, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 | 5 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Feb 19, 2025 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 | 5 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |