| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models | May 20, 2025 | Safety Alignment | CodeCode Available | 1 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| All Languages Matter: On the Multilingual Safety of Large Language Models | Oct 2, 2023 | AllSafety Alignment | CodeCode Available | 1 |
| Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Feb 22, 2024 | Backdoor AttackLanguage Modelling | CodeCode Available | 1 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 |
| MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming | May 22, 2025 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 |
| RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jun 9, 2025 | Safety Alignment | CodeCode Available | 1 |