| Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning | Jun 4, 2025 | Safety Alignment | —Unverified | 0 |
| DiaBlo: Diagonal Blocks Are Sufficient For Finetuning | Jun 3, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 |
| BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Jun 3, 2025 | Prompt EngineeringRed Teaming | CodeCode Available | 0 |
| Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models | Jun 2, 2025 | Safety Alignment | —Unverified | 0 |
| Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap | May 30, 2025 | Safety Alignment | —Unverified | 0 |
| TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | May 30, 2025 | DiversityLanguage Modeling | CodeCode Available | 0 |
| SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? | May 29, 2025 | DiagnosticRed Teaming | —Unverified | 0 |
| AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models | May 29, 2025 | Safety Alignment | CodeCode Available | 0 |
| EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions | May 29, 2025 | Safety Alignment | —Unverified | 0 |
| Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | May 28, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models | May 27, 2025 | Safety Alignment | CodeCode Available | 0 |
| PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | May 27, 2025 | counterfactualDiversity | —Unverified | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | May 26, 2025 | Safety Alignment | —Unverified | 0 |
| SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models | May 26, 2025 | Safety Alignment | CodeCode Available | 0 |
| Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | May 24, 2025 | Code GenerationMath | —Unverified | 0 |
| Safety Alignment via Constrained Knowledge Unlearning | May 24, 2025 | knowledge editingSafety Alignment | —Unverified | 0 |
| Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey | May 23, 2025 | Active LearningReinforcement Learning (RL) | —Unverified | 0 |
| Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary | May 23, 2025 | Safety Alignment | —Unverified | 0 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 |
| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 0 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning | May 22, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Shape it Up! Restoring LLM Safety during Finetuning | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | May 20, 2025 | Safety Alignment | —Unverified | 0 |
| sudoLLM : On Multi-role Alignment of Language Models | May 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Safety Alignment Can Be Not Superficial With Explicit Safety Signals | May 19, 2025 | Binary ClassificationData Augmentation | —Unverified | 0 |
| SafeVid: Toward Safety Aligned Video Large Multimodal Models | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| Noise Injection Systemically Degrades Large Language Model Safety Guardrails | May 16, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | May 15, 2025 | Malware DetectionSafety Alignment | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | May 12, 2025 | Code GenerationSafety Alignment | —Unverified | 0 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 |
| Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | May 10, 2025 | Safety Alignment | CodeCode Available | 0 |
| NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models | Apr 29, 2025 | Safety Alignment | —Unverified | 0 |
| SAGE: A Generic Framework for LLM Safety Evaluation | Apr 28, 2025 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift | Apr 28, 2025 | AttributeData Poisoning | —Unverified | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models | Apr 25, 2025 | DisentanglementSafety Alignment | CodeCode Available | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 |
| Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization | Apr 19, 2025 | Contrastive LearningImage Generation | —Unverified | 0 |
| Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models | Apr 18, 2025 | Safety Alignment | —Unverified | 0 |
| VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization | Apr 17, 2025 | Multimodal ReasoningSafety Alignment | —Unverified | 0 |