SOTAVerified

Safety Alignment

Papers

Showing 101125 of 288 papers

TitleStatusHype
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
VSCBench: Bridging the Gap in Vision-Language Model Safety CalibrationCode0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Safety Alignment via Constrained Knowledge Unlearning0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey0
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
From Evaluation to Defense: Advancing Safety in Video Large Language Models0
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing OptimizationCode0
Show:102550
← PrevPage 5 of 12Next →

No leaderboard results yet.