| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Jun 19, 2025 | Large Language ModelSafety Alignment | CodeCode Available | 1 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 |
| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 |
| RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jun 9, 2025 | Safety Alignment | CodeCode Available | 1 |
| SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models | Jun 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 |
| Can Editing LLMs Inject Harm? | Jul 29, 2024 | FairnessGeneral Knowledge | CodeCode Available | 1 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 |
| Backtracking for Safety | Mar 11, 2025 | Safety Alignment | —Unverified | 0 |
| Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification | Mar 14, 2025 | Safety Alignment | —Unverified | 0 |
| DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Feb 17, 2025 | Decision MakingLanguage Modeling | —Unverified | 0 |
| LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models | Apr 14, 2025 | Persuasion StrategiesSafety Alignment | —Unverified | 0 |
| Mitigating Unsafe Feedback with Learning Constraints | Sep 19, 2024 | Safety AlignmentText Generation | —Unverified | 0 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 |
| CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |