| Code-Switching Curriculum Learning for Multilingual Transfer in LLMs | Nov 4, 2024 | Cross-Lingual TransferLanguage Acquisition | —Unverified | 0 | 0 |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nov 16, 2023 | Safety Alignment | —Unverified | 0 | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 | 0 |
| CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 | 0 |
| DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Feb 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 | 0 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 | 0 |
| Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning | Jun 17, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? | Apr 14, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Enhancing Jailbreak Attacks with Diversity Guidance | Mar 1, 2024 | DiversityLanguage Modelling | —Unverified | 0 | 0 |
| Effectively Controlling Reasoning Models through Thinking Intervention | Mar 31, 2025 | Instruction FollowingSafety Alignment | —Unverified | 0 | 0 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 | 0 |
| Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens | May 31, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning | Jan 31, 2025 | BlockingSafety Alignment | —Unverified | 0 | 0 |
| PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment | Nov 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| EnJa: Ensemble Jailbreak on Large Language Models | Aug 7, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Nov 20, 2024 | FairnessSafety Alignment | —Unverified | 0 | 0 |
| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions | May 29, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Nov 27, 2024 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 | 0 |
| FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Feb 28, 2025 | Safety Alignment | —Unverified | 0 | 0 |