| Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning | Jan 31, 2025 | BlockingSafety Alignment | —Unverified | 0 | 0 |
| PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment | Nov 18, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| EnJa: Ensemble Jailbreak on Large Language Models | Aug 7, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Nov 20, 2024 | FairnessSafety Alignment | —Unverified | 0 | 0 |
| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions | May 29, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Nov 27, 2024 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 | 0 |
| FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Feb 28, 2025 | Safety Alignment | —Unverified | 0 | 0 |