| PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models | Jan 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 |
| RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Apr 14, 2025 | Safety Alignment | —Unverified | 0 |
| Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions | Feb 8, 2025 | Safety Alignment | —Unverified | 0 |
| Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation | Jun 9, 2025 | Safety Alignment | —Unverified | 0 |
| Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model | Mar 13, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | May 26, 2025 | Safety Alignment | —Unverified | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 |
| SafeArena: Evaluating the Safety of Autonomous Web Agents | Mar 6, 2025 | MisinformationSafety Alignment | —Unverified | 0 |
| SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? | May 29, 2025 | DiagnosticRed Teaming | —Unverified | 0 |
| SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |