| PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | May 27, 2025 | counterfactualDiversity | —Unverified | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models | May 27, 2025 | Safety Alignment | CodeCode Available | 0 |
| SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models | May 26, 2025 | Safety Alignment | CodeCode Available | 0 |
| Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | May 26, 2025 | Safety Alignment | —Unverified | 0 |
| VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 |
| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 |
| Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary | May 23, 2025 | Safety Alignment | —Unverified | 0 |