SOTAVerified

Safety Alignment

Papers

Showing 2650 of 288 papers

TitleStatusHype
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
VSCBench: Bridging the Gap in Vision-Language Model Safety CalibrationCode0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Lifelong Safety Alignment for Language ModelsCode1
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Safety Alignment via Constrained Knowledge Unlearning0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Shape it Up! Restoring LLM Safety during Finetuning0
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
From Evaluation to Defense: Advancing Safety in Video Large Language Models0
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Show:102550
← PrevPage 2 of 12Next →

No leaderboard results yet.