SOTAVerified

Safety Alignment

Papers

Showing 201250 of 288 papers

TitleStatusHype
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
Understanding and Rectifying Safety Perception Distortion in VLMs0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models0
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift0
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response0
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
Robustifying Safety-Aligned Large Language Models through Clean Data Curation0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks0
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models0
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment0
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models0
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models0
Safety Alignment Can Be Not Superficial With Explicit Safety Signals0
Safety Alignment for Vision Language Models0
Safety Alignment via Constrained Knowledge Unlearning0
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability DistributionsCode0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful KnowledgeCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning ModelCode0
Show:102550
← PrevPage 5 of 6Next →

No leaderboard results yet.