| Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | May 28, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | May 27, 2025 | counterfactualDiversity | —Unverified | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models | May 27, 2025 | Safety Alignment | CodeCode Available | 0 |
| VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models | May 26, 2025 | Safety Alignment | CodeCode Available | 0 |
| Lifelong Safety Alignment for Language Models | May 26, 2025 | Safety Alignment | CodeCode Available | 1 |
| Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | May 26, 2025 | Safety Alignment | —Unverified | 0 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 |