| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 | 5 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 | 5 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack | May 28, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Oct 23, 2023 | Adversarial AttackBlocking | CodeCode Available | 1 | 5 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 | 5 |
| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |