SOTAVerified

Safety Alignment

Papers

Showing 150 of 288 papers

TitleStatusHype
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMsCode2
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning0
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks0
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification0
Probing the Robustness of Large Language Models Safety to Latent PerturbationsCode1
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsCode1
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning0
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMsCode0
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring0
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguardsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models0
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
VSCBench: Bridging the Gap in Vision-Language Model Safety CalibrationCode0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Lifelong Safety Alignment for Language ModelsCode1
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Safety Alignment via Constrained Knowledge Unlearning0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Shape it Up! Restoring LLM Safety during Finetuning0
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
From Evaluation to Defense: Advancing Safety in Video Large Language Models0
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.