| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 | 0 |
| DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Feb 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 | 0 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 | 0 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? | Apr 14, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 | 0 |
| Effectively Controlling Reasoning Models through Thinking Intervention | Mar 31, 2025 | Instruction FollowingSafety Alignment | —Unverified | 0 | 0 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 | 0 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 | 0 |