| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 |
| Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | May 15, 2025 | Malware DetectionSafety Alignment | —Unverified | 0 |
| EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions | May 29, 2025 | Safety Alignment | —Unverified | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 |
| C3AI: Crafting and Evaluating Constitutions for Constitutional AI | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 |
| Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine | Nov 20, 2024 | FairnessSafety Alignment | —Unverified | 0 |
| EnJa: Ensemble Jailbreak on Large Language Models | Aug 7, 2024 | Safety Alignment | —Unverified | 0 |