| Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | May 27, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 | 5 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| OR-Bench: An Over-Refusal Benchmark for Large Language Models | May 31, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models | Jun 9, 2025 | Multi-agent Reinforcement LearningSafety Alignment | CodeCode Available | 1 | 5 |
| Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning | Aug 18, 2024 | PhilosophySafety Alignment | CodeCode Available | 1 | 5 |
| LookAhead Tuning: Safer Language Models via Partial Answer Previews | Mar 24, 2025 | PositionSafety Alignment | CodeCode Available | 1 | 5 |
| Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Jul 31, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 | 5 |
| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Jun 17, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 1 | 5 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 | 5 |
| Improving LLM Safety Alignment with Dual-Objective Optimization | Mar 5, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 | 5 |