SOTAVerified

Safety Alignment

Papers

Showing 101150 of 288 papers

TitleStatusHype
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models0
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
VSCBench: Bridging the Gap in Vision-Language Model Safety CalibrationCode0
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Safety Alignment via Constrained Knowledge Unlearning0
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing OptimizationCode0
From Evaluation to Defense: Advancing Safety in Video Large Language Models0
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
Shape it Up! Restoring LLM Safety during Finetuning0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment0
sudoLLM : On Multi-role Alignment of Language Models0
Safety Alignment Can Be Not Superficial With Explicit Safety Signals0
SafeVid: Toward Safety Aligned Video Large Multimodal Models0
JULI: Jailbreak Large Language Models by Self-Introspection0
Noise Injection Systemically Degrades Large Language Model Safety Guardrails0
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs0
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning ModelCode0
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift0
AI Awareness0
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsCode0
aiXamine: Simplified LLM Safety and Security0
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization0
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.