| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis | Feb 13, 2025 | Safety Alignment | CodeCode Available | 3 |
| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey | Sep 26, 2024 | Safety Alignment | CodeCode Available | 3 |
| The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Jul 15, 2025 | Code GenerationSafety Alignment | CodeCode Available | 2 |
| PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | May 20, 2025 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction | May 16, 2025 | Contrastive LearningSafety Alignment | CodeCode Available | 2 |
| LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation | Apr 10, 2025 | Code GenerationContinual Learning | CodeCode Available | 2 |
| STAIR: Improving Safety Alignment with Introspective Reasoning | Feb 4, 2025 | Safety Alignment | CodeCode Available | 2 |
| Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Jan 29, 2025 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Oct 14, 2024 | LLM JailbreakSafety Alignment | CodeCode Available | 2 |
| Cross-Modality Safety Alignment | Jun 21, 2024 | Safety Alignment | CodeCode Available | 2 |
| Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Jun 10, 2024 | Safety Alignment | CodeCode Available | 2 |
| How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Jun 9, 2024 | Safety Alignment | CodeCode Available | 2 |
| AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Apr 11, 2024 | Safety Alignment | CodeCode Available | 2 |
| CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion | Mar 12, 2024 | Code CompletionSafety Alignment | CodeCode Available | 2 |
| DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Feb 25, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 2 |
| Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning | Feb 21, 2024 | Instruction FollowingLanguage Modeling | CodeCode Available | 2 |
| ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Feb 19, 2024 | Safety Alignment | CodeCode Available | 2 |
| Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | Feb 3, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 2 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Aug 12, 2023 | EthicsRed Teaming | CodeCode Available | 2 |
| Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Jun 19, 2025 | Large Language ModelSafety Alignment | CodeCode Available | 1 |
| Probing the Robustness of Large Language Models Safety to Latent Perturbations | Jun 19, 2025 | DiagnosticSafety Alignment | CodeCode Available | 1 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 |
| Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models | Jun 9, 2025 | Multi-agent Reinforcement LearningSafety Alignment | CodeCode Available | 1 |
| RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jun 9, 2025 | Safety Alignment | CodeCode Available | 1 |
| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 |
| MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming | May 22, 2025 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Safety Subspaces are Not Distinct: A Fine-Tuning Case Study | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender | Apr 13, 2025 | Safety Alignment | CodeCode Available | 1 |
| VPO: Aligning Text-to-Video Generation Models with Prompt Optimization | Mar 26, 2025 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| sudo rm -rf agentic_security | Mar 26, 2025 | Adversarial AttackAI and Safety | CodeCode Available | 1 |
| LookAhead Tuning: Safer Language Models via Partial Answer Previews | Mar 24, 2025 | PositionSafety Alignment | CodeCode Available | 1 |
| SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging | Mar 21, 2025 | GSM8KSafety Alignment | CodeCode Available | 1 |
| Improving LLM Safety Alignment with Dual-Objective Optimization | Mar 5, 2025 | Safety Alignment | CodeCode Available | 1 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks | Feb 28, 2025 | Safety Alignment | CodeCode Available | 1 |
| Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Feb 16, 2025 | Safety Alignment | CodeCode Available | 1 |
| X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability | Feb 14, 2025 | Safety Alignment | CodeCode Available | 1 |
| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 |
| Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions | Feb 6, 2025 | Safety Alignment | CodeCode Available | 1 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 |
| xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking | Jan 28, 2025 | Reinforcement Learning (RL)Safety Alignment | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |