SOTAVerified

Safety Alignment

Papers

Showing 125 of 288 papers

TitleStatusHype
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMsCode2
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning0
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks0
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification0
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsCode1
Probing the Robustness of Large Language Models Safety to Latent PerturbationsCode1
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning0
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMsCode0
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring0
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguardsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models0
Show:102550
← PrevPage 1 of 12Next →

No leaderboard results yet.