| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 |
| CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |